0% found this document useful (0 votes)
884 views

Spark Structured Streaming

The document provides a table of contents for a technical document on Spark Structured Streaming. It lists 18 sections that cover topics like streaming joins, extending structured streaming with new data sources, stateful stream processing, and monitoring of streaming query execution. The sections are organized under headings including "Streaming Join", "Extending Structured Streaming with New Data Sources", and "Monitoring of Streaming Query Execution".

Uploaded by

Tim Johnson
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
884 views

Spark Structured Streaming

The document provides a table of contents for a technical document on Spark Structured Streaming. It lists 18 sections that cover topics like streaming joins, extending structured streaming with new data sources, stateful stream processing, and monitoring of streaming query execution. The sections are organized under headings including "Streaming Join", "Extending Structured Streaming with New Data Sources", and "Monitoring of Streaming Query Execution".

Uploaded by

Tim Johnson
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 655

Table

of Contents
Introduction 1.1
Spark Structured Streaming and Streaming Queries 1.2
Batch Processing Time 1.2.1
Internals of Streaming Queries 1.3

Streaming Join
Streaming Join 2.1
StateStoreAwareZipPartitionsRDD 2.2
SymmetricHashJoinStateManager 2.3
StateStoreHandler 2.3.1
KeyToNumValuesStore 2.3.2
KeyWithIndexToValueStore 2.3.3
OneSideHashJoiner 2.4
JoinStateWatermarkPredicates 2.5
JoinStateWatermarkPredicate 2.5.1
StateStoreAwareZipPartitionsHelper 2.6
StreamingSymmetricHashJoinHelper 2.7
StreamingJoinHelper 2.8

Extending Structured Streaming with New


Data Sources
Extending Structured Streaming with New Data Sources 3.1
BaseStreamingSource 3.2
BaseStreamingSink 3.3
StreamWriteSupport 3.4
StreamWriter 3.5
DataSource 3.6

1
Demos
Demos 4.1
Internals of FlatMapGroupsWithStateExec Physical Operator 4.2
Arbitrary Stateful Streaming Aggregation with
KeyValueGroupedDataset.flatMapGroupsWithState Operator 4.3
Exploring Checkpointed State 4.4
Streaming Watermark with Aggregation in Append Output Mode 4.5
Streaming Query for Running Counts (Socket Source and Complete Output Mode) 4.6
Streaming Aggregation with Kafka Data Source 4.7
groupByKey Streaming Aggregation in Update Mode 4.8
StateStoreSaveExec with Complete Output Mode 4.9
StateStoreSaveExec with Update Output Mode 4.10
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI) 4.11
current_timestamp Function For Processing Time in Streaming Queries 4.12
Using StreamingQueryManager for Query Termination Management 4.13

Streaming Aggregation
Streaming Aggregation 5.1
StateStoreRDD 5.2
StateStoreOps 5.2.1
StreamingAggregationStateManager 5.3
StreamingAggregationStateManagerBaseImpl 5.3.1
StreamingAggregationStateManagerImplV1 5.3.2
StreamingAggregationStateManagerImplV2 5.3.3

Stateful Stream Processing


Stateful Stream Processing 6.1
Streaming Watermark 6.2
Streaming Deduplication 6.3
Streaming Limit 6.4
StateStore 6.5

2
StateStoreId 6.5.1
HDFSBackedStateStore 6.5.2
StateStoreProvider 6.6
StateStoreProviderId 6.6.1
HDFSBackedStateStoreProvider 6.6.2
StateStoreCoordinator 6.7
StateStoreCoordinatorRef 6.7.1
WatermarkSupport 6.8
StatefulOperator 6.9
StateStoreReader 6.9.1
StateStoreWriter 6.9.2
StatefulOperatorStateInfo 6.10
StateStoreMetrics 6.11
StateStoreCustomMetric 6.12
StateStoreUpdater 6.13
EventTimeStatsAccum 6.14
StateStoreConf 6.15

Arbitrary Stateful Streaming Aggregation


Arbitrary Stateful Streaming Aggregation 7.1
GroupState 7.2
GroupStateImpl 7.2.1
GroupStateTimeout 7.3
StateManager 7.4
StateManagerImplV2 7.4.1
StateManagerImplBase 7.4.2
StateManagerImplV1 7.4.3
FlatMapGroupsWithStateExecHelper Helper Class 7.5
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator 7.6

Developing Streaming Applications


DataStreamReader 8.1

3
DataStreamWriter 8.2
OutputMode 8.2.1
Trigger 8.2.2
StreamingQuery 8.3
Streaming Operators 8.4
dropDuplicates Operator 8.4.1
explain Operator 8.4.2
groupBy Operator 8.4.3
groupByKey Operator 8.4.4
withWatermark Operator 8.4.5
window Function 8.5
KeyValueGroupedDataset 8.6
mapGroupsWithState Operator 8.6.1
flatMapGroupsWithState Operator 8.6.2
StreamingQueryManager 8.7
SQLConf 8.8
Configuration Properties 8.9

Monitoring of Streaming Query Execution


StreamingQueryListener 9.1
ProgressReporter 9.2
StreamingQueryProgress 9.3
ExecutionStats 9.3.1
SourceProgress 9.3.2
SinkProgress 9.3.3
StreamingQueryStatus 9.4
MetricsReporter 9.5
Web UI 9.6
Logging 9.7

File-Based Data Source


FileStreamSource 10.1

4
FileStreamSink 10.2
FileStreamSinkLog 10.3
SinkFileStatus 10.4
ManifestFileCommitProtocol 10.5
MetadataLogFileIndex 10.6

Kafka Data Source


Kafka Data Source 11.1
KafkaSourceProvider 11.2
KafkaSource 11.3
KafkaRelation 11.4
KafkaSourceRDD 11.5
CachedKafkaConsumer 11.6
KafkaSourceOffset 11.7
KafkaOffsetReader 11.8
ConsumerStrategy 11.9
KafkaSink 11.10
KafkaOffsetRangeLimit 11.11
KafkaDataConsumer 11.12
KafkaMicroBatchReader 11.13
KafkaOffsetRangeCalculator 11.13.1
KafkaMicroBatchInputPartition 11.13.2
KafkaMicroBatchInputPartitionReader 11.13.3
KafkaSourceInitialOffsetWriter 11.13.4
KafkaContinuousReader 11.14
KafkaContinuousInputPartition 11.14.1

Text Socket Data Source


TextSocketSourceProvider 12.1
TextSocketSource 12.2

5
Rate Data Source
RateSourceProvider 13.1
RateStreamSource 13.2
RateStreamMicroBatchReader 13.3

Console Data Sink


ConsoleSinkProvider 14.1
ConsoleWriter 14.2

Foreach Data Sink


ForeachWriterProvider 15.1
ForeachWriter 15.2
ForeachSink 15.3

ForeachBatch Data Sink


ForeachBatchSink 16.1

Memory Data Source


Memory Data Source 17.1
MemoryStream 17.2
ContinuousMemoryStream 17.3
MemorySink 17.4
MemorySinkV2 17.5
MemoryStreamWriter 17.5.1
MemoryStreamBase 17.6
MemorySinkBase 17.7

6
Offsets and Metadata Checkpointing (Fault-
Tolerance and Reliability)
Offsets and Metadata Checkpointing 18.1
MetadataLog 18.2
HDFSMetadataLog 18.3
CommitLog 18.4
CommitMetadata 18.4.1
OffsetSeqLog 18.5
OffsetSeq 18.5.1
CompactibleFileStreamLog 18.6
FileStreamSourceLog 18.6.1
OffsetSeqMetadata 18.7
CheckpointFileManager 18.8
FileContextBasedCheckpointFileManager 18.8.1
FileSystemBasedCheckpointFileManager 18.8.2
Offset 18.9
StreamProgress 18.10

Micro-Batch Stream Processing (Structured


Streaming V1)
Micro-Batch Stream Processing 19.1
MicroBatchExecution 19.2
MicroBatchWriter 19.2.1
MicroBatchReadSupport 19.3
MicroBatchReader 19.3.1
WatermarkTracker 19.4
Source 19.5
StreamSourceProvider 19.5.1
Sink 19.6
StreamSinkProvider 19.6.1

7
Continuous Stream Processing (Structured
Streaming V2)
Continuous Stream Processing 20.1
ContinuousExecution 20.2
ContinuousReadSupport Contract 20.3
ContinuousReader Contract 20.4
RateStreamContinuousReader 20.5
EpochCoordinator RPC Endpoint 20.6
EpochCoordinatorRef 20.6.1
EpochTracker 20.6.2
ContinuousQueuedDataReader 20.7
DataReaderThread 20.7.1
EpochMarkerGenerator 20.7.2
PartitionOffset 20.8
ContinuousExecutionRelation Leaf Logical Operator 20.9
WriteToContinuousDataSource Unary Logical Operator 20.10
WriteToContinuousDataSourceExec Unary Physical Operator 20.11
ContinuousWriteRDD 20.11.1
ContinuousDataSourceRDD 20.12

Query Planning and Execution


StreamExecution 21.1
StreamingQueryWrapper 21.1.1
TriggerExecutor 21.2
IncrementalExecution 21.3
StreamingQueryListenerBus 21.4
StreamMetadata 21.5

Logical Operators
EventTimeWatermark Unary Logical Operator 22.1
FlatMapGroupsWithState Unary Logical Operator 22.2

8
Deduplicate Unary Logical Operator 22.3
MemoryPlan Logical Query Plan 22.4
StreamingRelation Leaf Logical Operator for Streaming Source 22.5
StreamingRelationV2 Leaf Logical Operator 22.6
StreamingExecutionRelation Leaf Logical Operator for Streaming Source At Execution
22.7

Physical Operators
EventTimeWatermarkExec 23.1
FlatMapGroupsWithStateExec 23.2
StateStoreRestoreExec 23.3
StateStoreSaveExec 23.4
StreamingDeduplicateExec 23.5
StreamingGlobalLimitExec 23.6
StreamingRelationExec 23.7
StreamingSymmetricHashJoinExec 23.8

Execution Planning Strategies


FlatMapGroupsWithStateStrategy 24.1
StatefulAggregationStrategy 24.2
StreamingDeduplicationStrategy 24.3
StreamingGlobalLimitStrategy 24.4
StreamingJoinStrategy 24.5
StreamingRelationStrategy 24.6

Varia
UnsupportedOperationChecker 25.1

9
Introduction

The Internals of Spark Structured Streaming


(Apache Spark 2.4.4)
Welcome to The Internals of Spark Structured Streaming gitbook! I’m very excited to
have you here and hope you will enjoy exploring the internals of Spark Structured Streaming
as much as I have.

I write to discover what I know.

— Flannery O'Connor
I’m Jacek Laskowski, a freelance IT consultant, software engineer and technical instructor
specializing in Apache Spark, Apache Kafka and Kafka Streams (with Scala and sbt).

I offer software development and consultancy services with hands-on in-depth workshops
and mentoring. Reach out to me at [email protected] or @jaceklaskowski to discuss
opportunities.

Consider joining me at Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw,
Poland.

I’m also writing other books in the "The Internals of" series about Apache Spark,
Tip
Spark SQL, Apache Kafka, and Kafka Streams.

Expect text and code snippets from a variety of public sources. Attribution follows.

Now, let me introduce you to Spark Structured Streaming and Streaming Queries.

10
Spark Structured Streaming and Streaming Queries

Spark Structured Streaming and Streaming


Queries
Spark Structured Streaming (aka Structured Streaming or Spark Streams) is the module of
Apache Spark for stream processing using streaming queries.

Streaming queries can be expressed using a high-level declarative streaming API (Dataset
API) or good ol' SQL (SQL over stream / streaming SQL). The declarative streaming Dataset
API and SQL are executed on the underlying highly-optimized Spark SQL engine.

The semantics of the Structured Streaming model is as follows (see the article Structured
Streaming In Apache Spark):

At any time, the output of a continuous application is equivalent to executing a batch job
on a prefix of the data.

As of Spark 2.2.0, Structured Streaming has been marked stable and ready for
production use. With that the other older streaming module Spark Streaming is
Note
considered obsolete and not used for developing new streaming applications
with Apache Spark.

Spark Structured Streaming comes with two stream execution engines for executing
streaming queries:

MicroBatchExecution for Micro-Batch Stream Processing

ContinuousExecution for Continuous Stream Processing

The goal of Spark Structured Streaming is to unify streaming, interactive, and batch queries
over structured datasets for developing end-to-end stream processing applications dubbed
continuous applications using Spark SQL’s Datasets API with additional support for the
following features:

Streaming Aggregation

Streaming Join

Streaming Watermark

Arbitrary Stateful Streaming Aggregation

Stateful Stream Processing

11
Spark Structured Streaming and Streaming Queries

In Structured Streaming, Spark developers describe custom streaming computations in the


same way as with Spark SQL. Internally, Structured Streaming applies the user-defined
structured query to the continuously and indefinitely arriving data to analyze real-time
streaming data.

Structured Streaming introduces the concept of streaming datasets that are infinite
datasets with primitives like input streaming data sources and output streaming data sinks.

A Dataset is streaming when its logical plan is streaming.

val batchQuery = spark.


read. // <-- batch non-streaming query
csv("sales")

assert(batchQuery.isStreaming == false)

val streamingQuery = spark.


Tip readStream. // <-- streaming query
format("rate").
load

assert(streamingQuery.isStreaming)

Read up on Spark SQL, Datasets and logical plans in The Internals of Spark SQL
book.

Structured Streaming models a stream of data as an infinite (and hence continuous) table
that could be changed every streaming batch.

You can specify output mode of a streaming dataset which is what gets written to a
streaming sink (i.e. the infinite result table) when there is a new data available.

Streaming Datasets use streaming query plans (as opposed to regular batch Datasets that
are based on batch query plans).

From this perspective, batch queries can be considered streaming Datasets


executed once only (and is why some batch queries, e.g. KafkaSource, can
easily work in batch mode).

val batchQuery = spark.read.format("rate").load


Note
assert(batchQuery.isStreaming == false)

val streamingQuery = spark.readStream.format("rate").load

assert(streamingQuery.isStreaming)

With Structured Streaming, Spark 2 aims at simplifying streaming analytics with little to no
need to reason about effective data streaming (trying to hide the unnecessary complexity in
your streaming analytics architectures).

12
Spark Structured Streaming and Streaming Queries

Structured streaming is defined by the following data abstractions in


org.apache.spark.sql.streaming package:

StreamingQuery

Streaming Source

Streaming Sink

StreamingQueryManager

Structured Streaming follows micro-batch model and periodically fetches data from the data
source (and uses the DataFrame data abstraction to represent the fetched data for a certain
batch).

With Datasets as Spark SQL’s view of structured data, structured streaming checks input
sources for new data every trigger (time) and executes the (continuous) queries.

The feature has also been called Streaming Spark SQL Query, Streaming
Note DataFrames, Continuous DataFrame or Continuous Query. There have
been lots of names before the Spark project settled on Structured Streaming.

Further Reading Or Watching


SPARK-8360 Structured Streaming (aka Streaming DataFrames)

The official Structured Streaming Programming Guide

(article) Structured Streaming In Apache Spark

(video) The Future of Real Time in Spark from Spark Summit East 2016 in which
Reynold Xin presents the concept of Streaming DataFrames

(video) Structuring Spark: DataFrames, Datasets, and Streaming

(article) What Spark’s Structured Streaming really means

(video) A Deep Dive Into Structured Streaming by Tathagata "TD" Das from Spark
Summit 2016

(video) Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark by


Burak Yavuz

13
Batch Processing Time

Batch Processing Time


Batch Processing Time (aka Batch Timeout Threshold) is the processing time
(processing timestamp) of the current streaming batch.

The following standard functions (and their Catalyst expressions) allow accessing the batch
processing time in Micro-Batch Stream Processing:

now , current_timestamp , and unix_timestamp functions ( CurrentTimestamp )

current_date function ( CurrentDate )

CurrentTimestamp or CurrentDate expressions are not supported in


Note
Continuous Stream Processing.

Internals
GroupStateImpl is given the batch processing time when created for a streaming query (that
is actually the batch processing time of the FlatMapGroupsWithStateExec physical
operator).

When created, FlatMapGroupsWithStateExec physical operator has the processing time


undefined and set to the current timestamp in the state preparation rule every streaming
batch.

The current timestamp (and other batch-specific configurations) is given as the


OffsetSeqMetadata (as part of the query planning phase) when a stream execution engine
does the following:

MicroBatchExecution is requested to construct a next streaming micro-batch in Micro-

Batch Stream Processing

In Continuous Stream Processing the base StreamExecution is requested to run stream


processing and initializes OffsetSeqMetadata to 0 s.

14
Internals of Streaming Queries

Internals of Streaming Queries


The page is to keep notes about how to guide readers through the codebase
Note
and may disappear if merged with the other pages or become an intro page.

1. DataStreamReader and Streaming Data Source

2. Data Source Resolution, Streaming Dataset and Logical Query Plan

3. Dataset API — High-Level DSL to Build Logical Query Plan

4. DataStreamWriter and Streaming Data Sink

5. StreamingQuery

6. StreamingQueryManager

DataStreamReader and Streaming Data Source


It all starts with SparkSession.readStream method which lets you define a streaming source
in a stream processing pipeline (aka streaming processing graph or dataflow graph).

import org.apache.spark.sql.SparkSession
assert(spark.isInstanceOf[SparkSession])

val reader = spark.readStream

import org.apache.spark.sql.streaming.DataStreamReader
assert(reader.isInstanceOf[DataStreamReader])

SparkSession.readStream method creates a DataStreamReader.

The fluent API of DataStreamReader allows you to describe the input data source (e.g.
DataStreamReader.format and DataStreamReader.options) using method chaining (with the
goal of making the readability of the source code close to that of ordinary written prose,
essentially creating a domain-specific language within the interface. See Fluent interface
article in Wikipedia).

reader
.format("csv")
.option("delimiter", "|")

15
Internals of Streaming Queries

There are a couple of built-in data source formats. Their names are the names of the
corresponding DataStreamReader methods and so act like shortcuts of
DataStreamReader.format (where you have to specify the format by name), i.e. csv, json, orc,

parquet and text, followed by DataStreamReader.load.

You may also want to use DataStreamReader.schema method to specify the schema of the
streaming data source.

reader.schema("a INT, b STRING")

In the end, you use DataStreamReader.load method that simply creates a streaming Dataset
(the good ol' Dataset that you may have already used in Spark SQL).

val input = reader


.format("csv")
.option("delimiter", "\t")
.schema("word STRING, num INT")
.load("data/streaming")

import org.apache.spark.sql.DataFrame
assert(input.isInstanceOf[DataFrame])

The Dataset has the isStreaming property enabled that is basically the only way you could
distinguish streaming Datasets from regular, batch Datasets.

assert(input.isStreaming)

In other words, Spark Structured Streaming is designed to extend the features of Spark SQL
and let your structured queries be streaming queries.

Data Source Resolution, Streaming Dataset and Logical


Query Plan
Being curious about the internals of streaming Datasets is where you start…​seeing numbers
not humans (sorry, couldn’t resist drawing the comparison between Matrix the movie and the
internals of Spark Structured Streaming).

Whenever you create a Dataset (be it batch in Spark SQL or streaming in Spark Structured
Streaming) is when you create a logical query plan using the high-level Dataset DSL.

A logical query plan is made up of logical operators.

16
Internals of Streaming Queries

Spark Structured Streaming gives you two logical operators to represent streaming sources,
i.e. StreamingRelationV2 and StreamingRelation.

When DataStreamReader.load method is executed, load first looks up the requested data
source (that you specified using DataStreamReader.format) and creates an instance of it
(instantiation). That’d be data source resolution step (that I described in…​FIXME).

DataStreamReader.load is where you can find the intersection of the former Micro-Batch

Stream Processing V1 API with the new Continuous Stream Processing V2 API.

For MicroBatchReadSupport or ContinuousReadSupport data sources,


DataStreamReader.load creates a logical query plan with a StreamingRelationV2 leaf logical

operator. That is the new V2 code path.

StreamingRelationV2 Logical Operator for Data Source V2

// rate data source is V2


val rates = spark.readStream.format("rate").load
val plan = rates.queryExecution.logical
scala> println(plan.numberedTreeString)
00 StreamingRelationV2 org.apache.spark.sql.execution.streaming.sources.RateStreamProv
ider@2ed03b1a, rate, [timestamp#12, value#13L]

For all other types of streaming data sources, DataStreamReader.load creates a logical query
plan with a StreamingRelation leaf logical operator. That is the former V1 code path.

StreamingRelation Logical Operator for Data Source V1

// text data source is V1


val texts = spark.readStream.format("text").load("data/streaming")
val plan = texts.queryExecution.logical
scala> println(plan.numberedTreeString)
00 StreamingRelation DataSource(org.apache.spark.sql.SparkSession@35edd886,text,List(),
None,List(),None,Map(path -> data/streaming),None), FileSource[data/streaming], [value#
18]

Dataset API — High-Level DSL to Build Logical Query Plan


With a streaming Dataset created, you can now use all the methods of Dataset API,
including but not limited to the following operators:

Dataset.dropDuplicates for streaming deduplication

Dataset.groupBy and Dataset.groupByKey for streaming aggregation

Dataset.withWatermark for event time watermark

17
Internals of Streaming Queries

Please note that a streaming Dataset is a regular Dataset (with some streaming-related
limitations).

val rates = spark


.readStream
.format("rate")
.load
val countByTime = rates
.withWatermark("timestamp", "10 seconds")
.groupBy($"timestamp")
.agg(count("*") as "count")

import org.apache.spark.sql.Dataset
assert(countByTime.isInstanceOf[Dataset[_]])

The point is to understand that the Dataset API is a domain-specific language (DSL) to build
a more sophisticated stream processing pipeline that you could also build using the low-level
logical operators directly.

Use Dataset.explain to learn the underlying logical and physical query plans.

18
Internals of Streaming Queries

assert(countByTime.isStreaming)

scala> countByTime.explain(extended = true)


== Parsed Logical Plan ==
'Aggregate ['timestamp], [unresolvedalias('timestamp, None), count(1) AS count#131L]
+- EventTimeWatermark timestamp#88: timestamp, interval 10 seconds
+- StreamingRelationV2 org.apache.spark.sql.execution.streaming.sources.RateStreamP
rovider@2fcb3082, rate, [timestamp#88, value#89L]

== Analyzed Logical Plan ==


timestamp: timestamp, count: bigint
Aggregate [timestamp#88-T10000ms], [timestamp#88-T10000ms, count(1) AS count#131L]
+- EventTimeWatermark timestamp#88: timestamp, interval 10 seconds
+- StreamingRelationV2 org.apache.spark.sql.execution.streaming.sources.RateStreamP
rovider@2fcb3082, rate, [timestamp#88, value#89L]

== Optimized Logical Plan ==


Aggregate [timestamp#88-T10000ms], [timestamp#88-T10000ms, count(1) AS count#131L]
+- EventTimeWatermark timestamp#88: timestamp, interval 10 seconds
+- Project [timestamp#88]
+- StreamingRelationV2 org.apache.spark.sql.execution.streaming.sources.RateStre
amProvider@2fcb3082, rate, [timestamp#88, value#89L]

== Physical Plan ==
*(5) HashAggregate(keys=[timestamp#88-T10000ms], functions=[count(1)], output=[timesta
mp#88-T10000ms, count#131L])
+- StateStoreSave [timestamp#88-T10000ms], state info [ checkpoint = <unknown>, runId
= 28606ba5-9c7f-4f1f-ae41-e28d75c4d948, opId = 0, ver = 0, numPartitions = 200], Append
, 0, 2
+- *(4) HashAggregate(keys=[timestamp#88-T10000ms], functions=[merge_count(1)], out
put=[timestamp#88-T10000ms, count#136L])
+- StateStoreRestore [timestamp#88-T10000ms], state info [ checkpoint = <unknown
>, runId = 28606ba5-9c7f-4f1f-ae41-e28d75c4d948, opId = 0, ver = 0, numPartitions = 200
], 2
+- *(3) HashAggregate(keys=[timestamp#88-T10000ms], functions=[merge_count(1)
], output=[timestamp#88-T10000ms, count#136L])
+- Exchange hashpartitioning(timestamp#88-T10000ms, 200)
+- *(2) HashAggregate(keys=[timestamp#88-T10000ms], functions=[partial_
count(1)], output=[timestamp#88-T10000ms, count#136L])
+- EventTimeWatermark timestamp#88: timestamp, interval 10 seconds
+- *(1) Project [timestamp#88]
+- StreamingRelation rate, [timestamp#88, value#89L]

Or go pro and talk to QueryExecution directly.

19
Internals of Streaming Queries

val plan = countByTime.queryExecution.logical


scala> println(plan.numberedTreeString)
00 'Aggregate ['timestamp], [unresolvedalias('timestamp, None), count(1) AS count#131L
]
01 +- EventTimeWatermark timestamp#88: timestamp, interval 10 seconds
02 +- StreamingRelationV2 org.apache.spark.sql.execution.streaming.sources.RateStre
amProvider@2fcb3082, rate, [timestamp#88, value#89L]

Please note that most of the stream processing operators you may also have used in batch
structured queries in Spark SQL. Again, the distinction between Spark SQL and Spark
Structured Streaming is very thin from a developer’s point of view.

DataStreamWriter and Streaming Data Sink


Once you’re satisfied with building a stream processing pipeline (using the APIs of
DataStreamReader, Dataset, RelationalGroupedDataset and KeyValueGroupedDataset ), you
should define how and when the result of the streaming query is persisted in (sent out to) an
external data system using a streaming sink.

Read up on the APIs of Dataset, RelationalGroupedDataset and


Tip
KeyValueGroupedDataset in The Internals of Spark SQL book.

You should use Dataset.writeStream method that simply creates a DataStreamWriter.

// Not only is this a Dataset, but it is also streaming


assert(countByTime.isStreaming)

val writer = countByTime.writeStream

import org.apache.spark.sql.streaming.DataStreamWriter
assert(writer.isInstanceOf[DataStreamWriter[_]])

The fluent API of DataStreamWriter allows you to describe the output data sink (e.g.
DataStreamWriter.format and DataStreamWriter.options) using method chaining (with the
goal of making the readability of the source code close to that of ordinary written prose,
essentially creating a domain-specific language within the interface. See Fluent interface
article in Wikipedia).

writer
.format("csv")
.option("delimiter", "\t")

20
Internals of Streaming Queries

Like in DataStreamReader data source formats, there are a couple of built-in data sink
formats. Unlike data source formats, their names do not have corresponding
DataStreamWriter methods. The reason is that you will use DataStreamWriter.start to create

and immediately start a StreamingQuery.

There are however two special output formats that do have corresponding
DataStreamWriter methods, i.e. DataStreamWriter.foreach and

DataStreamWriter.foreachBatch, that allow for persisting query results to external data


systems that do not have streaming sinks available. They give you a trade-off between
developing a full-blown streaming sink and simply using the methods (that lay the basis of
what a custom sink would have to do anyway).

DataStreamWriter API defines two new concepts (that are not available in the "base" Spark

SQL):

OutputMode that you specify using DataStreamWriter.outputMode method

Trigger that you specify using DataStreamWriter.trigger method

You may also want to give a streaming query a name using DataStreamWriter.queryName
method.

In the end, you use DataStreamWriter.start method to create and immediately start a
StreamingQuery.

import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = writer
.format("console")
.option("truncate", false)
.option("checkpointLocation", "/tmp/csv-to-csv-checkpoint")
.outputMode(OutputMode.Append)
.trigger(Trigger.ProcessingTime(30.seconds))
.queryName("csv-to-csv")
.start("/tmp")

import org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isInstanceOf[StreamingQuery])

When DataStreamWriter is requested to start a streaming query, it allows for the following
data source formats:

memory with MemorySinkV2 (with ContinuousTrigger) or MemorySink

foreach with ForeachWriterProvider sink

21
Internals of Streaming Queries

foreachBatch with ForeachBatchSink sink (that does not support ContinuousTrigger)

Any DataSourceRegister data source

Custom data sources specified by their fully-qualified class names or


[name].DefaultSource

avro, kafka and some others (see DataSource.lookupDataSource object method)

StreamWriteSupport

DataSource is requested to create a streaming sink that accepts StreamSinkProvider or

FileFormat data sources only

With a streaming sink, DataStreamWriter requests the StreamingQueryManager to start a


streaming query.

StreamingQuery
When a stream processing pipeline is started (using DataStreamWriter.start method),
DataStreamWriter creates a StreamingQuery and requests the StreamingQueryManager to

start a streaming query.

StreamingQueryManager
StreamingQueryManager is used to manage streaming queries.

22
Streaming Join

Streaming Join
In Spark Structured Streaming, a streaming join is a streaming query that was described
(build) using the high-level streaming operators:

Dataset.crossJoin

Dataset.join

Dataset.joinWith

SQL’s JOIN clause

Streaming joins can be stateless or stateful:

Joins of a streaming query and a batch query (stream-static joins) are stateless and no
state management is required

Joins of two streaming queries (stream-stream joins) are stateful and require streaming
state (with an optional join state watermark for state removal).

Stream-Stream Joins
Spark Structured Streaming supports stream-stream joins with the following:

Equality predicate (i.e. equi-joins that use only equality comparisons in the join
predicate)

Inner , LeftOuter , and RightOuter join types only

Stream-stream equi-joins are planned as StreamingSymmetricHashJoinExec physical


operators of two ShuffleExchangeExec physical operators (per Required Partition
Requirements).

Join State Watermark for State Removal


Stream-stream joins may optionally define Join State Watermark for state removal (cf.
Watermark Predicates for State Removal).

A join state watermark can be specified on the following:

1. Join keys (key state)

2. Columns of the left and right sides (value state)

23
Streaming Join

A join state watermark can be specified on key state, value state or both.

IncrementalExecution — QueryExecution of Streaming
Queries
Under the covers, the high-level operators create a logical query plan with one or more
Join logical operators.

Tip Read up on Join Logical Operator in The Internals of Spark SQL online book.

In Spark Structured Streaming IncrementalExecution is responsible for planning streaming


queries for execution.

At query planning, IncrementalExecution uses the StreamingJoinStrategy execution


planning strategy for planning stream-stream joins as StreamingSymmetricHashJoinExec
physical operators.

Demos
Use the following demo application to learn more:

StreamStreamJoinApp

Further Reading Or Watching


Stream-stream Joins in the official documentation of Apache Spark for Structured
Streaming

Introducing Stream-Stream Joins in Apache Spark 2.3 by Databricks

(video) Deep Dive into Stateful Stream Processing in Structured Streaming by


Tathagata Das

24
StateStoreAwareZipPartitionsRDD

StateStoreAwareZipPartitionsRDD
StateStoreAwareZipPartitionsRDD is a ZippedPartitionsRDD2 with the left and right parent

RDDs.

StateStoreAwareZipPartitionsRDD is created exclusively when

StreamingSymmetricHashJoinExec physical operator is requested to execute and generate a

recipe for a distributed computation (as an RDD[InternalRow]) (and requests


StateStoreAwareZipPartitionsHelper for one).

Creating StateStoreAwareZipPartitionsRDD Instance


StateStoreAwareZipPartitionsRDD takes the following to be created:

SparkContext

Function ( (Iterator[A], Iterator[B]) ⇒ Iterator[V] , e.g. processPartitions)

Left RDD - the RDD of the left side of a join ( RDD[A] )

Right RDD - the RDD of the right side of a join ( RDD[B] )

StatefulOperatorStateInfo

Names of the state stores

StateStoreCoordinatorRef

Placement Preferences of Partition (Preferred Locations) 


—  getPreferredLocations Method

getPreferredLocations(partition: Partition): Seq[String]

getPreferredLocations is a part of the RDD Contract to specify placement


Note preferences (aka preferred task locations), i.e. where tasks should be executed
to be as close to the data as possible.

getPreferredLocations simply requests the StateStoreCoordinatorRef for the location of

every state store (with the StatefulOperatorStateInfo and the partition ID) and returns unique
executor IDs (so that processing a partition happens on the executor with the proper state
store for the operator and the partition).

25
StateStoreAwareZipPartitionsRDD

26
SymmetricHashJoinStateManager

SymmetricHashJoinStateManager
SymmetricHashJoinStateManager is created for the left and right OneSideHashJoiners of a

StreamingSymmetricHashJoinExec physical operator (one for each side when


StreamingSymmetricHashJoinExec is requested to process partitions of the left and right sides

of a stream-stream join).

Figure 1. SymmetricHashJoinStateManager and Stream-Stream Join


SymmetricHashJoinStateManager manages join state using the KeyToNumValuesStore and

the KeyWithIndexToValueStore state store handlers (and simply acts like their facade).

Creating SymmetricHashJoinStateManager Instance


SymmetricHashJoinStateManager takes the following to be created:

JoinSide

Attributes of input values

Join keys ( Seq[Expression] )

27
SymmetricHashJoinStateManager

StatefulOperatorStateInfo

StateStoreConf

Hadoop Configuration

SymmetricHashJoinStateManager initializes the internal properties.

KeyToNumValuesStore and KeyWithIndexToValueStore


State Store Handlers —  keyToNumValues and
keyWithIndexToValue Internal Properties
SymmetricHashJoinStateManager uses a KeyToNumValuesStore ( keyToNumValues ) and a

KeyWithIndexToValueStore ( keyWithIndexToValue ) internally that are created immediately


when SymmetricHashJoinStateManager is created (for a OneSideHashJoiner).

keyToNumValues and keyWithIndexToValue are used when SymmetricHashJoinStateManager is

requested for the following:

Retrieving the value rows by key

Append a new value row to a given key

removeByKeyCondition

removeByValueCondition

Commit state changes

Abort state changes

Performance metrics

Join Side Marker —  JoinSide Internal Enum


JoinSide can be one of the two possible values:

LeftSide (alias: left )

RightSide (alias: right )

They are both used exclusively when StreamingSymmetricHashJoinExec binary physical


operator is requested to execute (and process partitions of the left and right sides of a
stream-stream join with an OneSideHashJoiner).

Performance Metrics —  metrics Method

28
SymmetricHashJoinStateManager

metrics: StateStoreMetrics

metrics returns the combined StateStoreMetrics of the KeyToNumValuesStore and the

KeyWithIndexToValueStore state store handlers.

metrics is used exclusively when OneSideHashJoiner is requested to


Note
commitStateAndGetMetrics.

removeByKeyCondition Method

removeByKeyCondition(
removalCondition: UnsafeRow => Boolean): Iterator[UnsafeRowPair]

removeByKeyCondition creates an Iterator of UnsafeRowPairs that removes keys (and

associated values) for which the given removalCondition predicate holds.

removeByKeyCondition uses the KeyToNumValuesStore for all state keys and values (in the

underlying state store).

removeByKeyCondition is used exclusively when OneSideHashJoiner is


Note
requested to remove an old state (for JoinStateKeyWatermarkPredicate).

getNext Internal Method (of removeByKeyCondition Method)

getNext(): UnsafeRowPair

getNext goes over the keys and values in the allKeyToNumValues sequence and removes

keys (from the KeyToNumValuesStore) and the corresponding values (from the
KeyWithIndexToValueStore) for which the given removalCondition predicate holds.

removeByValueCondition Method

removeByValueCondition(
removalCondition: UnsafeRow => Boolean): Iterator[UnsafeRowPair]

removeByValueCondition creates an Iterator of UnsafeRowPairs that removes values (and

associated keys if needed) for which the given removalCondition predicate holds.

29
SymmetricHashJoinStateManager

removeByValueCondition is used exclusively when OneSideHashJoiner is


Note requested to remove an old state (when JoinStateValueWatermarkPredicate is
used).

getNext Internal Method (of removeByValueCondition


Method)

getNext(): UnsafeRowPair

getNext …​FIXME

Appending New Value Row to Key —  append Method

append(
key: UnsafeRow,
value: UnsafeRow): Unit

append requests the KeyToNumValuesStore for the number of value rows for the given key.

In the end, append requests the stores for the following:

KeyWithIndexToValueStore to store the given value row

KeyToNumValuesStore to store the given key with the number of value rows
incremented.

append is used exclusively when OneSideHashJoiner is requested to


Note
storeAndJoinWithOtherSide.

Retrieving Value Rows By Key —  get Method

get(key: UnsafeRow): Iterator[UnsafeRow]

get requests the KeyToNumValuesStore for the number of value rows for the given key.

In the end, get requests the KeyWithIndexToValueStore to retrieve that number of value
rows for the given key and leaves value rows only.

get is used when OneSideHashJoiner is requested to


Note
storeAndJoinWithOtherSide and retrieving value rows for a key.

30
SymmetricHashJoinStateManager

Committing State (Changes) —  commit Method

commit(): Unit

commit simply requests the keyToNumValues and keyWithIndexToValue state store

handlers to commit state changes.

commit is used exclusively when OneSideHashJoiner is requested to commit


Note
state changes and get performance metrics.

Aborting State (Changes) —  abortIfNeeded Method

abortIfNeeded(): Unit

abortIfNeeded …​FIXME

Note abortIfNeeded is used when…​FIXME

allStateStoreNames Object Method

allStateStoreNames(joinSides: JoinSide*): Seq[String]

allStateStoreNames simply returns the names of the state stores for all possible

combinations of the given JoinSides and the two possible store types (e.g.
keyToNumValues and keyWithIndexToValue).

allStateStoreNames is used exclusively when StreamingSymmetricHashJoinExec


Note physical operator is requested to execute and generate the runtime
representation (as a RDD[InternalRow] ).

getStateStoreName Object Method

getStateStoreName(
joinSide: JoinSide,
storeType: StateStoreType): String

getStateStoreName simply returns a string of the following format:

[joinSide]-[storeType]

31
SymmetricHashJoinStateManager

getStateStoreName is used when:

StateStoreHandler is requested to load a state store

Note SymmetricHashJoinStateManager utility is requested for allStateStoreNames


(for StreamingSymmetricHashJoinExec physical operator to execute and
generate the runtime representation)

updateNumValueForCurrentKey Internal Method

updateNumValueForCurrentKey(): Unit

updateNumValueForCurrentKey …​FIXME

updateNumValueForCurrentKey is used exclusively when


Note
SymmetricHashJoinStateManager is requested to removeByValueCondition.

Internal Properties

Name Description
Key attributes, i.e. AttributeReferences of the key schema
Used exclusively in KeyWithIndexToValueStore when
keyAttributes
requested for the keyWithIndexExprs,
indexOrdinalInKeyWithIndexRow,
keyWithIndexRowGenerator and keyRowGenerator

Key schema ( StructType ) based on the join keys with the


names in the format of field and their ordinals (index)

Used when:
SymmetricHashJoinStateManager is requested for the key
keySchema
attributes (for KeyWithIndexToValueStore)

KeyToNumValuesStore is requested for the state store

KeyWithIndexToValueStore is requested for the


keyWithIndexSchema (for the internal state store)

32
StateStoreHandler

StateStoreHandler Internal Contract


StateStoreHandler is the internal base of state store handlers that manage a StateStore

(i.e. commit, abortIfNeeded and metrics).

StateStoreHandler takes a single StateStoreType to be created:

KeyToNumValuesType for KeyToNumValuesStore (alias: keyToNumValues )

KeyWithIndexToValueType for KeyWithIndexToValueStore (alias: keyWithIndexToValue )

StateStoreHandler is a Scala private abstract class and cannot be created


Note
directly. It is created indirectly for the concrete StateStoreHandlers.

Table 1. StateStoreHandler Contract


Method Description

stateStore: StateStore
stateStore

StateStore

Table 2. StateStoreHandlers
StateStoreHandler Description
KeyToNumValuesStore StateStoreHandler of KeyToNumValuesType

KeyWithIndexToValueStore

Enable ALL logging levels for


org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager.StateStoreHandler
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager.Sta

Refer to Logging.

Performance Metrics —  metrics Method

metrics: StateStoreMetrics

33
StateStoreHandler

metrics simply requests the StateStore for the StateStoreMetrics.

metrics is used exclusively when SymmetricHashJoinStateManager is requested


Note
for the metrics.

Committing State (Changes to State Store) —  commit


Method

commit(): Unit

commit …​FIXME

Note commit is used when…​FIXME

abortIfNeeded Method

abortIfNeeded(): Unit

abortIfNeeded …​FIXME

Note abortIfNeeded is used when…​FIXME

Loading State Store (By Key and Value Schemas) 


—  getStateStore Method

getStateStore(
keySchema: StructType,
valueSchema: StructType): StateStore

getStateStore creates a new StateStoreProviderId (for the StatefulOperatorStateInfo of the

owning SymmetricHashJoinStateManager , the partition ID from the execution context, and the
name of the state store for the JoinSide and StateStoreType).

getStateStore uses the StateStore utility to look up a StateStore for the

StateStoreProviderId.

In the end, getStateStore prints out the following INFO message to the logs:

Loaded store [storeId]

34
StateStoreHandler

getStateStore is used when KeyToNumValuesStore and


Note KeyWithIndexToValueStore state store handlers are created (for
SymmetricHashJoinStateManager).

StateStoreType Contract (Sealed Trait)


StateStoreType is required to create a StateStoreHandler.

Table 3. StateStoreTypes
StateStoreType toString Description
KeyToNumValuesType keyToNumValues

KeyWithIndexToValueType keyWithIndexToValue

StateStoreType is a Scala private sealed trait which means that all the
Note
implementations are in the same compilation unit (a single file).

35
KeyToNumValuesStore

KeyToNumValuesStore — State Store (Handler)


Of Join Keys And Counts
KeyToNumValuesStore is a StateStoreHandler (of KeyToNumValuesType) for

SymmetricHashJoinStateManager to manage a join state.

Figure 1. KeyToNumValuesStore, KeyWithIndexToValueStore and Stream-Stream Join


As a StateStoreHandler, KeyToNumValuesStore manages a state store (that is loaded) with
the join keys (per key schema) and their count (per value schema).

KeyToNumValuesStore uses the schema for values in the state store with one field value (of

type long ) that is the number of value rows (count).

36
KeyToNumValuesStore

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$KeyToNumValuesSto
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$Key

Refer to Logging.

Looking Up Number Of Value Rows For Given Key (Value


Count) —  get Method

get(key: UnsafeRow): Long

get requests the StateStore for the value for the given key and returns the long value at

0 th position (of the row found) or 0 .

get is used when SymmetricHashJoinStateManager is requested for the values


Note
for a given key and append a new value to a given key.

Storing Key Count For Given Key —  put Method

put(
key: UnsafeRow,
numValues: Long): Unit

put stores the numValues at the 0 th position (of the internal unsafe row) and requests

the StateStore to store it with the given key.

put requires that the numValues count is greater than 0 (or throws an

IllegalArgumentException ).

put is used when SymmetricHashJoinStateManager is requested for the append


Note
a new value to a given key and updateNumValueForCurrentKey.

All State Keys and Values —  iterator Method

iterator: Iterator[KeyAndNumValues]

37
KeyToNumValuesStore

iterator simply requests the StateStore for all state keys and values.

iterator is used when SymmetricHashJoinStateManager is requested to


Note
removeByKeyCondition and removeByValueCondition.

Removing State Key —  remove Method

remove(key: UnsafeRow): Unit

remove simply requests the StateStore to remove the given key.

Note remove is used when…​FIXME

38
KeyWithIndexToValueStore

KeyWithIndexToValueStore — State Store
(Handler) Of Join Keys With Index Of Values
KeyWithIndexToValueStore is a StateStoreHandler (of KeyWithIndexToValueType) for

SymmetricHashJoinStateManager to manage a join state.

Figure 1. KeyToNumValuesStore, KeyWithIndexToValueStore and Stream-Stream Join


As a StateStoreHandler, KeyWithIndexToValueStore manages a state store (that is loaded)
for keys and values per the keys with index and input values schemas, respectively.

KeyWithIndexToValueStore uses a schema (for the state store) that is the key schema (of the

parent SymmetricHashJoinStateManager ) with an extra field index of type long .

39
KeyWithIndexToValueStore

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$KeyWithIndexToVal
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$Key

Refer to Logging.

Looking Up State Row For Given Key and Index —  get


Method

get(
key: UnsafeRow,
valueIndex: Long): UnsafeRow

get simply requests the internal state store to look up the value for the given key and

valueIndex.

get is used exclusively when SymmetricHashJoinStateManager is requested to


Note
removeByValueCondition

Retrieving (Given Number of) Values for Key —  getAll


Method

getAll(
key: UnsafeRow,
numValues: Long): Iterator[KeyWithIndexAndValue]

getAll …​FIXME

getAll is used when SymmetricHashJoinStateManager is requested to get


Note
values for a given key and removeByKeyCondition.

Storing State Row For Given Key and Index —  put


Method

40
KeyWithIndexToValueStore

put(
key: UnsafeRow,
valueIndex: Long,
value: UnsafeRow): Unit

put …​FIXME

put is used when SymmetricHashJoinStateManager is requested to append a


Note
new value to a given key and removeByKeyCondition.

remove Method

remove(
key: UnsafeRow,
valueIndex: Long): Unit

remove …​FIXME

remove is used when SymmetricHashJoinStateManager is requested to


Note
removeByKeyCondition and removeByValueCondition.

keyWithIndexRow Internal Method

keyWithIndexRow(
key: UnsafeRow,
valueIndex: Long): UnsafeRow

keyWithIndexRow uses the keyWithIndexRowGenerator to generate an UnsafeRow for the

key and sets the valueIndex at the indexOrdinalInKeyWithIndexRow position.

keyWithIndexRow is used when KeyWithIndexToValueStore is requested to get,


Note
getAll, put, remove and removeAllValues.

removeAllValues Method

removeAllValues(
key: UnsafeRow,
numValues: Long): Unit

removeAllValues …​FIXME

41
KeyWithIndexToValueStore

Note removeAllValues does not seem to be used at all.

iterator Method

iterator: Iterator[KeyWithIndexAndValue]

iterator …​FIXME

Note iterator does not seem to be used at all.

Internal Properties

Name Description
Position of the index in the key row (which
indexOrdinalInKeyWithIndexRow corresponds to the number of the key attributes)
Used exclusively in the keyWithIndexRow

keyAttributes with Literal(1L) expression


appended
keyWithIndexExprs
Used exclusively for the keyWithIndexRowGenerator
projection

UnsafeProjection for the keyWithIndexExprs bound


keyWithIndexRowGenerator to the keyAttributes
Used exclusively in keyWithIndexRow

42
OneSideHashJoiner

OneSideHashJoiner
OneSideHashJoiner manages join state of one side of a stream-stream join (using

SymmetricHashJoinStateManager).

OneSideHashJoiner is created exclusively for StreamingSymmetricHashJoinExec physical

operator (when requested to process partitions of the left and right sides of a stream-stream
join).

Figure 1. OneSideHashJoiner and StreamingSymmetricHashJoinExec


StreamingSymmetricHashJoinExec physical operator uses two OneSideHashJoiners per side of

the stream-stream join (left and right sides).

OneSideHashJoiner uses an optional join state watermark predicate to remove old state.

OneSideHashJoiner is a Scala private internal class of


Note StreamingSymmetricHashJoinExec and so has full access to
StreamingSymmetricHashJoinExec properties.

Creating OneSideHashJoiner Instance


OneSideHashJoiner takes the following to be created:

JoinSide

Input attributes ( Seq[Attribute] )

Join keys ( Seq[Expression] )

Input rows ( Iterator[InternalRow] )

Optional pre-join filter Catalyst expression

Post-join filter ( (InternalRow) ⇒ Boolean )

43
OneSideHashJoiner

JoinStateWatermarkPredicate

OneSideHashJoiner initializes the internal registries and counters.

SymmetricHashJoinStateManager —  joinStateManager
Internal Property

joinStateManager: SymmetricHashJoinStateManager

joinStateManager is a SymmetricHashJoinStateManager that is created for a

OneSideHashJoiner (with the join side, the input attributes, the join keys, and the

StatefulOperatorStateInfo of the owning StreamingSymmetricHashJoinExec).

joinStateManager is used when OneSideHashJoiner is requested for the following:

storeAndJoinWithOtherSide

Get the values for a given key

Remove an old state

commitStateAndGetMetrics

Number of Updated State Rows 


—  updatedStateRowsCount Internal Counter
updatedStateRowsCount is the number the join keys and associated rows that were persisted

as a join state, i.e. how many times storeAndJoinWithOtherSide requested the


SymmetricHashJoinStateManager to append the join key and the input row (to a join state).

updatedStateRowsCount is then used (via numUpdatedStateRows method) for the

numUpdatedStateRows performance metric.

updatedStateRowsCount is available via numUpdatedStateRows method.

numUpdatedStateRows: Long

numUpdatedStateRows is used exclusively when StreamingSymmetricHashJoinExec


physical operator is requested to [spark-sql-streaming-
Note
StreamingSymmetricHashJoinExec#processPartitions process partitions of the
left and right sides of a stream-stream join] (and completes).

44
OneSideHashJoiner

Optional Join State Watermark Predicate 


—  stateWatermarkPredicate Internal Property

stateWatermarkPredicate: Option[JoinStateWatermarkPredicate]

When created, OneSideHashJoiner is given a JoinStateWatermarkPredicate.

stateWatermarkPredicate is used for the stateKeyWatermarkPredicateFunc (when a

JoinStateKeyWatermarkPredicate) and the stateValueWatermarkPredicateFunc (when a


JoinStateValueWatermarkPredicate) that are both used when OneSideHashJoiner is
requested to removeOldState.

storeAndJoinWithOtherSide Method

storeAndJoinWithOtherSide(
otherSideJoiner: OneSideHashJoiner)(
generateJoinedRow: (InternalRow, InternalRow) => JoinedRow): Iterator[InternalRow]

storeAndJoinWithOtherSide tries to find the watermark attribute among the input attributes.

storeAndJoinWithOtherSide creates a watermark expression (for the watermark attribute and

the current event-time watermark).

With the watermark attribute found, storeAndJoinWithOtherSide generates a new predicate


for the watermark expression and the input attributes that is then used to filter out (exclude)
late rows from the input. Otherwise, the input rows are left unchanged (i.e. no rows are
considered late and excluded).

For every input row (possibly watermarked), storeAndJoinWithOtherSide applies the


preJoinFilter predicate and branches off per result (true or false).

storeAndJoinWithOtherSide is used when StreamingSymmetricHashJoinExec


Note physical operator is requested to process partitions of the left and right sides of
a stream-stream join.

preJoinFilter Predicate Positive ( true )

When the preJoinFilter predicate succeeds on an input row, storeAndJoinWithOtherSide


extracts the join key (using the keyGenerator) and requests the given OneSideHashJoiner
( otherSideJoiner ) for the SymmetricHashJoinStateManager that is in turn requested for the
state values for the extracted join key. The values are then processed (mapped over) using
the given generateJoinedRow function and then filtered by the post-join filter.

45
OneSideHashJoiner

storeAndJoinWithOtherSide uses the stateKeyWatermarkPredicateFunc (on the extracted

join key) and the stateValueWatermarkPredicateFunc (on the current input row) to determine
whether to request the SymmetricHashJoinStateManager to append the key and the input
row (to a join state). If so, storeAndJoinWithOtherSide increments the
updatedStateRowsCount counter.

preJoinFilter Predicate Negative ( false )

When the preJoinFilter predicate fails on an input row, storeAndJoinWithOtherSide creates a


new Iterator[InternalRow] of joined rows per join side and type:

For LeftSide and LeftOuter , the join row is the current row with the values of the right
side all null ( nullRight )

For RightSide and RightOuter , the join row is the current row with the values of the left
side all null ( nullLeft )

For all other combinations, the iterator is simply empty (that will be removed from the
output by the outer nonLateRows.flatMap).

Removing Old State —  removeOldState Method

removeOldState(): Iterator[UnsafeRowPair]

removeOldState branches off per the JoinStateWatermarkPredicate:

For JoinStateKeyWatermarkPredicate, removeOldState requests the


SymmetricHashJoinStateManager to removeByKeyCondition (with the
stateKeyWatermarkPredicateFunc)

For JoinStateValueWatermarkPredicate, removeOldState requests the


SymmetricHashJoinStateManager to removeByValueCondition (with the
stateValueWatermarkPredicateFunc)

For any other predicates, removeOldState returns an empty iterator (no rows to
process)

removeOldState is used exclusively when StreamingSymmetricHashJoinExec


Note physical operator is requested to process partitions of the left and right sides of
a stream-stream join.

Retrieving Value Rows For Key —  get Method

46
OneSideHashJoiner

get(key: UnsafeRow): Iterator[UnsafeRow]

get simply requests the SymmetricHashJoinStateManager to retrieve value rows for the

key.

get is used exclusively when StreamingSymmetricHashJoinExec physical


Note operator is requested to process partitions of the left and right sides of a
stream-stream join.

Committing State (Changes) and Requesting Performance


Metrics —  commitStateAndGetMetrics Method

commitStateAndGetMetrics(): StateStoreMetrics

commitStateAndGetMetrics simply requests the SymmetricHashJoinStateManager to commit

followed by requesting for the performance metrics.

commitStateAndGetMetrics is used exclusively when


Note StreamingSymmetricHashJoinExec physical operator is requested to process
partitions of the left and right sides of a stream-stream join.

Internal Properties

47
OneSideHashJoiner

Name Description

keyGenerator: UnsafeProjection

keyGenerator
Function to project (extract) join keys from an input row
Used when…​FIXME

preJoinFilter: InternalRow => Boolean


preJoinFilter

Used when…​FIXME

stateKeyWatermarkPredicateFunc: InternalRow => Boolean

Predicate for late rows based on the


stateWatermarkPredicate
stateKeyWatermarkPredicateFunc
Used for the following:
storeAndJoinWithOtherSide (and check out whether to
append a row to the SymmetricHashJoinStateManager
removeOldState

stateValueWatermarkPredicateFunc: InternalRow => Boolean

Predicate for late rows based on the


stateWatermarkPredicate
stateValueWatermarkPredicateFunc
Used for the following:

storeAndJoinWithOtherSide (and check out whether to


append a row to the SymmetricHashJoinStateManager
removeOldState

48
JoinStateWatermarkPredicates

JoinStateWatermarkPredicates — Watermark
Predicates for State Removal
JoinStateWatermarkPredicates contains watermark predicates for state removal of the

children of a StreamingSymmetricHashJoinExec physical operator:

JoinStateWatermarkPredicate for the left-hand side of a join (default: None )

JoinStateWatermarkPredicate for the right-hand side of a join (default: None )

JoinStateWatermarkPredicates is created for the following:

StreamingSymmetricHashJoinExec physical operator is created (with the optional


properties undefined, including JoinStateWatermarkPredicates)

StreamingSymmetricHashJoinHelper utility is requested for one (for IncrementalExecution

for the state preparation rule to optimize and specify the execution-specific configuration
for a query plan with StreamingSymmetricHashJoinExec physical operators)

Textual Representation —  toString Method

toString: String

toString is part of the java.lang.Object contract for the string representation of


Note
the object.

toString uses the left and right predicates for the string representation:

state cleanup [ left [left], right [right] ]

49
JoinStateWatermarkPredicate

JoinStateWatermarkPredicate Contract (Sealed


Trait)
JoinStateWatermarkPredicate is the abstraction of join state watermark predicates that are

described by a Catalyst expression and desc.

JoinStateWatermarkPredicate is created using StreamingSymmetricHashJoinHelper utility

(for planning a StreamingSymmetricHashJoinExec physical operator for execution with


execution-specific configuration)

JoinStateWatermarkPredicate is used to create a OneSideHashJoiner (and

JoinStateWatermarkPredicates).

Table 1. JoinStateWatermarkPredicate Contract


Method Description

desc: String
desc

Used exclusively for the textual representation

expr: Expression

expr A Catalyst Expression


Used for the textual representation and a
JoinStateWatermarkPredicates (for
StreamingSymmetricHashJoinExec physical operator)

50
JoinStateWatermarkPredicate

Table 2. JoinStateWatermarkPredicates
JoinStateWatermarkPredicate Description

Watermark predicate on state keys (i.e. when the


streaming watermark is defined either on the left
or right join keys)
Created when
StreamingSymmetricHashJoinHelper utility is
requested for a JoinStateWatermarkPredicates
for the left and right side of a stream-stream join
JoinStateKeyWatermarkPredicate
(when IncrementalExecution is requested to
optimize a query plan with a
StreamingSymmetricHashJoinExec physical
operator)
Used when OneSideHashJoiner is requested for
the stateKeyWatermarkPredicateFunc and then
to remove an old state

JoinStateValueWatermarkPredicate Watermark predicate on state values

JoinStateWatermarkPredicate is a Scala sealed trait which means that all the


Note
implementations are in the same compilation unit (a single file).

Textual Representation —  toString Method

toString: String

toString is part of the java.lang.Object contract for the string representation of


Note
the object.

toString uses the desc and expr for the string representation:

[desc]: [expr]

51
StateStoreAwareZipPartitionsHelper

StateStoreAwareZipPartitionsHelper — 
Extension Methods for Creating
StateStoreAwareZipPartitionsRDD
StateStoreAwareZipPartitionsHelper is a Scala implicit class of a data RDD (of type

RDD[T] ) to create a StateStoreAwareZipPartitionsRDD for

StreamingSymmetricHashJoinExec physical operator.

Implicit Classes are a language feature in Scala for implicit conversions with
Note
extension methods for existing types.

Creating StateStoreAwareZipPartitionsRDD 
—  stateStoreAwareZipPartitions Method

stateStoreAwareZipPartitions[U: ClassTag, V: ClassTag](


dataRDD2: RDD[U],
stateInfo: StatefulOperatorStateInfo,
storeNames: Seq[String],
storeCoordinator: StateStoreCoordinatorRef
)(f: (Iterator[T], Iterator[U]) => Iterator[V]): RDD[V]

stateStoreAwareZipPartitions simply creates a new StateStoreAwareZipPartitionsRDD.

stateStoreAwareZipPartitions is used exclusively when


Note StreamingSymmetricHashJoinExec physical operator is requested to execute and
generate a recipe for a distributed computation (as an RDD[InternalRow]).

52
StreamingSymmetricHashJoinHelper

StreamingSymmetricHashJoinHelper Utility
StreamingSymmetricHashJoinHelper is a Scala object with the following utility methods:

getStateWatermarkPredicates

Creating JoinStateWatermarkPredicates 
—  getStateWatermarkPredicates Object Method

getStateWatermarkPredicates(
leftAttributes: Seq[Attribute],
rightAttributes: Seq[Attribute],
leftKeys: Seq[Expression],
rightKeys: Seq[Expression],
condition: Option[Expression],
eventTimeWatermark: Option[Long]): JoinStateWatermarkPredicates

getStateWatermarkPredicates tries to find the index of the watermark attribute among the left

keys first, and if not found, the right keys.

Note The watermark attribute is defined using Dataset.withWatermark operator.

getStateWatermarkPredicates determines the state watermark predicate for the left side of a

join (for the given leftAttributes , the leftKeys and the rightAttributes ).

getStateWatermarkPredicates determines the state watermark predicate for the right side of

a join (for the given rightAttributes , the rightKeys and the leftAttributes ).

In the end, getStateWatermarkPredicates creates a JoinStateWatermarkPredicates with the


left- and right-side state watermark predicates.

getStateWatermarkPredicates is used exclusively when IncrementalExecution


is requested to apply the state preparation rule for batch-specific configuration
Note
(while optimizing query plans with StreamingSymmetricHashJoinExec physical
operators).

Join State Watermark Predicate (for One Side of Join) 


—  getOneSideStateWatermarkPredicate Internal Method

53
StreamingSymmetricHashJoinHelper

getOneSideStateWatermarkPredicate(
oneSideInputAttributes: Seq[Attribute],
oneSideJoinKeys: Seq[Expression],
otherSideInputAttributes: Seq[Attribute]): Option[JoinStateWatermarkPredicate]

getOneSideStateWatermarkPredicate finds what attributes were used to define the watermark

attribute (the oneSideInputAttributes attributes, the left or right join keys) and creates a
JoinStateWatermarkPredicate as follows:

JoinStateKeyWatermarkPredicate if the watermark was defined on a join key (with the


watermark expression for the index of the join key expression)

JoinStateValueWatermarkPredicate if the watermark was defined among the


oneSideInputAttributes (with the state value watermark based on the given

oneSideInputAttributes and otherSideInputAttributes )

getOneSideStateWatermarkPredicate creates no JoinStateWatermarkPredicate


Note
( None ) for no watermark found.

getStateWatermarkPredicates is used exclusively to create a


Note
JoinStateWatermarkPredicates.

54
StreamingJoinHelper

StreamingJoinHelper Utility
StreamingJoinHelper is a Scala object with the following utility methods:

getStateValueWatermark

Enable ALL logging level for


org.apache.spark.sql.catalyst.analysis.StreamingJoinHelper to see what
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.catalyst.analysis.StreamingJoinHelper=ALL

Refer to Logging.

State Value Watermark —  getStateValueWatermark


Object Method

getStateValueWatermark(
attributesToFindStateWatermarkFor: AttributeSet,
attributesWithEventWatermark: AttributeSet,
joinCondition: Option[Expression],
eventWatermark: Option[Long]): Option[Long]

getStateValueWatermark …​FIXME

getStateValueWatermark is used when:

UnsupportedOperationChecker utility is used to checkForStreaming


Note
StreamingSymmetricHashJoinHelper utility is used to create a
JoinStateWatermarkPredicates

55
Extending Structured Streaming with New Data Sources

Extending Structured Streaming with New Data


Sources
Spark Structured Streaming uses Spark SQL for planning streaming queries (preparing for
execution).

Spark SQL is migrating from the former Data Source API V1 to a new Data Source API V2,
and so is Structured Streaming. That is exactly the reason for BaseStreamingSource and
BaseStreamingSink APIs for the two different Data Source API’s class hierarchies, for
streaming sources and sinks, respectively.

Structured Streaming supports two stream execution engines (i.e. Micro-Batch and
Continuous) with their own APIs.

Micro-Batch Stream Processing supports the old Data Source API V1 and the new modern
Data Source API V2 with micro-batch-specific APIs for streaming sources and sinks.

Continuous Stream Processing supports the new modern Data Source API V2 only with
continuous-specific APIs for streaming sources and sinks.

The following are the questions to think of (and answer) while considering development of a
new data source for Structured Streaming. They are supposed to give you a sense of how
much work and time it takes as well as what Spark version to support (e.g. 2.2 vs 2.4).

Data Source API V1

Data Source API V2

Micro-Batch Stream Processing (Structured Streaming V1)

Continuous Stream Processing (Structured Streaming V2)

Read side (BaseStreamingSource)

Write side (BaseStreamingSink)

56
BaseStreamingSource

BaseStreamingSource Contract — Base of
Streaming Readers and Sources
BaseStreamingSource is the abstraction of streaming readers and sources that can be

stopped.

The main purpose of BaseStreamingSource is to share a common abstraction between the


former Data Source API V1 (Source API) and the modern Data Source API V2 (until Spark
Structured Streaming migrates to the Data Source API V2 fully).

Table 1. BaseStreamingSource Contract


Method Description

void stop()

Stops the streaming source or reader (and frees up any


resources it may have allocated)
stop
Used when:
StreamExecution is requested to stop streaming
sources and readers

DataStreamReader is requested to load data from a


MicroBatchReadSupport data source (for read schema)

Table 2. BaseStreamingSources (Extensions Only)


BaseStreamingSource Description
Data source readers in Continuous Stream Processing
ContinuousReader
(based on Data Source API V2)

Base implementation of ContinuousMemoryStream (for


MemoryStreamBase Continuous Stream Processing) and MemoryStream (for
Micro-Batch Stream Processing)

Data source readers in Micro-Batch Stream Processing


MicroBatchReader
(based on Data Source API V2)

Streaming sources for Micro-Batch Stream Processing


Source
(based on Data Source API V1)

57
BaseStreamingSource

58
BaseStreamingSink

BaseStreamingSink Contract — Base of
Streaming Writers and Sinks
BaseStreamingSink is the abstraction of streaming writers and sinks with the only purpose of

sharing a common abstraction between the former Data Source API V1 (Sink API) and the
modern Data Source API V2 (until Spark Structured Streaming migrates to the Data Source
API V2 fully).

BaseStreamingSink defines no methods.

Table 1. BaseStreamingSinks (Extensions Only)


BaseStreamingSink Description
MemorySinkBase Base contract for data sinks in memory data source

Streaming sinks for Micro-Batch Stream Processing (based


Sink
on Data Source API V1)

StreamWriteSupport Data source writers (based on Data Source API V2)

59
StreamWriteSupport

StreamWriteSupport Contract — Writable
Streaming Data Sources
StreamWriteSupport is the abstraction of DataSourceV2 sinks that create StreamWriters for

streaming write (when used in streaming queries in MicroBatchExecution and


ContinuousExecution).

StreamWriter createStreamWriter(
String queryId,
StructType schema,
OutputMode mode,
DataSourceOptions options)

createStreamWriter creates a StreamWriter for streaming write and is used when the

stream execution thread for a streaming query is started and requests the stream execution
engines to start, i.e.

ContinuousExecution is requested to runContinuous

MicroBatchExecution is requested to run a single streaming batch

Table 1. StreamWriteSupports
StreamWriteSupport Description
ConsoleSinkProvider Streaming sink for console data source format

ForeachWriterProvider

KafkaSourceProvider

MemorySinkV2

60
StreamWriter

StreamWriter Contract
StreamWriter is the extension of the DataSourceWriter contract to support epochs, i.e.

streaming writers that can abort and commit writing jobs for a specified epoch.

Tip Read up on DataSourceWriter in The Internals of Spark SQL book.

Table 1. StreamWriter Contract


Method Description

void abort(
long epochId,
WriterCommitMessage[] messages)

abort

Aborts the writing job for a specified epochId and


WriterCommitMessages

Used exclusively when MicroBatchWriter is requested to abort

void commit(
long epochId,
WriterCommitMessage[] messages)

commit
Commits the writing job for a specified epochId and
WriterCommitMessages

Used when:
EpochCoordinator is requested to commitEpoch

MicroBatchWriter is requested to commit

Table 2. StreamWriters
StreamWriter Description

ForeachWriterProvider foreachWriter data source

ConsoleWriter console data source

KafkaStreamWriter kafka data source

MemoryStreamWriter memory data source

61
DataSource

DataSource — Pluggable Data Provider


Framework
Read up on DataSource — Pluggable Data Provider Framework in The Internals
Tip
of Spark SQL online book.

Creating DataSource Instance


DataSource takes the following to be created:

SparkSession

className , i.e. the fully-qualified class name or an alias of the data source

Paths (default: Nil , i.e. an empty collection)

Optional user-defined schema (default: None )

Names of the partition columns (default: (empty))

Optional BucketSpec (default: None )

Configuration options (default: empty)

Optional CatalogTable (default: None )

DataSource initializes the internal properties.

Generating Metadata of Streaming Source (Data Source API


V1) —  sourceSchema Internal Method

sourceSchema(): SourceInfo

sourceSchema creates a new instance of the data source class and branches off per the

type, e.g. StreamSourceProvider, FileFormat and other types.

sourceSchema is used exclusively when DataSource is requested for the


Note
SourceInfo.

StreamSourceProvider

62
DataSource

For a StreamSourceProvider, sourceSchema requests the StreamSourceProvider for the


name and schema (of the streaming source).

In the end, sourceSchema returns the name and the schema as part of SourceInfo (with
partition columns unspecified).

FileFormat
For a FileFormat , sourceSchema …​FIXME

Other Types
For any other data source type, sourceSchema simply throws an
UnsupportedOperationException :

Data source [className] does not support streamed reading

Creating Streaming Source (Micro-Batch Stream


Processing / Data Source API V1) —  createSource
Method

createSource(
metadataPath: String): Source

createSource creates a new instance of the data source class and branches off per the

type, e.g. StreamSourceProvider, FileFormat and other types.

createSource is used exclusively when MicroBatchExecution is requested to


Note
initialize the analyzed logical plan.

StreamSourceProvider
For a StreamSourceProvider, createSource requests the StreamSourceProvider to create a
source.

FileFormat
For a FileFormat , createSource creates a new FileStreamSource.

createSource throws an IllegalArgumentException when path option was not specified for

a FileFormat data source:

63
DataSource

'path' is not specified

Other Types
For any other data source type, createSource simply throws an
UnsupportedOperationException :

Data source [className] does not support streamed reading

Creating Streaming Sink —  createSink Method

createSink(
outputMode: OutputMode): Sink

createSink creates a streaming sink for StreamSinkProvider or FileFormat data sources.

Tip Read up on FileFormat Data Source in The Internals of Spark SQL book.

Internally, createSink creates a new instance of the providingClass and branches off per
type:

For a StreamSinkProvider, createSink simply delegates the call and requests it to


create a streaming sink

For a FileFormat , createSink creates a FileStreamSink when path option is


specified and the output mode is Append

createSink throws a IllegalArgumentException when path option is not specified for a

FileFormat data source:

'path' is not specified

createSink throws an AnalysisException when the given OutputMode is different from

Append for a FileFormat data source:

Data source [className] does not support [outputMode] output mode

createSink throws an UnsupportedOperationException for unsupported data source formats:

Data source [className] does not support streamed writing

64
DataSource

createSink is used exclusively when DataStreamWriter is requested to start a


Note
streaming query.

Internal Properties

Name Description

providingClass
java.lang.Class for the className (that can be a fully-
qualified class name or an alias of the data source)

sourceInfo: SourceInfo

Metadata of a Source with the alias (short name), the


schema, and optional partitioning columns
sourceInfo is a lazy value and so initialized once (the very
first time) when accessed.
sourceInfo
Used when:
DataSource is requested to create a source (for a
FileFormat data source) (when MicroBatchExecution is
requested to initialize the analyzed logical plan)

StreamingRelation utility is requested for a


StreamingRelation (when DataStreamReader is
requested for a streaming DataFrame)

65
Demos

Demos
1. Demo: Internals of FlatMapGroupsWithStateExec Physical Operator

2. Demo: Exploring Checkpointed State

3. Demo: Streaming Watermark with Aggregation in Append Output Mode

4. Demo: Streaming Query for Running Counts (Socket Source and Complete Output
Mode)

5. Demo: Streaming Aggregation with Kafka Data Source

6. Demo: groupByKey Streaming Aggregation in Update Mode

7. Demo: StateStoreSaveExec with Complete Output Mode

8. Demo: StateStoreSaveExec with Update Output Mode

9. Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

10. current_timestamp Function For Processing Time in Streaming Queries

11. Using StreamingQueryManager for Query Termination Management

66
Internals of FlatMapGroupsWithStateExec Physical Operator

Demo: Internals of
FlatMapGroupsWithStateExec Physical
Operator
The following demo shows the internals of FlatMapGroupsWithStateExec physical operator
in a Arbitrary Stateful Streaming Aggregation.

// Reduce the number of partitions and hence the state stores


// That is supposed to make debugging state checkpointing easier
val numShufflePartitions = 1
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, numShufflePartitions)
assert(spark.sessionState.conf.numShufflePartitions == numShufflePartitions)

// Define event "format"


// Use :paste mode in spark-shell
import java.sql.Timestamp
case class Event(time: Timestamp, value: Long)
import scala.concurrent.duration._
object Event {
def apply(secs: Long, value: Long): Event = {
Event(new Timestamp(secs.seconds.toMillis), value)
}
}

// Using memory data source for full control of the input


import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
val events = MemoryStream[Event]
val values = events.toDS
assert(values.isStreaming, "values must be a streaming Dataset")

values.printSchema
/**
root
|-- time: timestamp (nullable = true)
|-- value: long (nullable = false)
*/

import scala.concurrent.duration._
val delayThreshold = 10.seconds
val valuesWatermarked = values
.withWatermark(eventTime = "time", delayThreshold.toString) // required for EventTim
eTimeout

// Could use Long directly, but...


// Let's use case class to make the demo a bit more advanced

67
Internals of FlatMapGroupsWithStateExec Physical Operator

case class Count(value: Long)

import java.sql.Timestamp
import org.apache.spark.sql.streaming.GroupState
val keyCounts = (key: Long, values: Iterator[(Timestamp, Long)], state: GroupState[Cou
nt]) => {
println(s""">>> keyCounts(key = $key, state = ${state.getOption.getOrElse("<empty>")
})""")
println(s">>> >>> currentProcessingTimeMs: ${state.getCurrentProcessingTimeMs}")
println(s">>> >>> currentWatermarkMs: ${state.getCurrentWatermarkMs}")
println(s">>> >>> hasTimedOut: ${state.hasTimedOut}")
val count = Count(values.length)
Iterator((key, count))
}

import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode}


val valuesCounted = valuesWatermarked
.as[(Timestamp, Long)] // convert DataFrame to Dataset to make groupByKey easier to
write
.groupByKey { case (time, value) => value }
.flatMapGroupsWithState(
OutputMode.Update,
timeoutConf = GroupStateTimeout.EventTimeTimeout)(func = keyCounts)
.toDF("value", "count")

valuesCounted.explain
/**
== Physical Plan ==
*(2) Project [_1#928L AS value#931L, _2#929 AS count#932]
+- *(2) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#928L
, if (isnull(assertnotnull(input[0, scala.Tuple2, true])._2)) null else named_struct(v
alue, assertnotnull(assertnotnull(input[0, scala.Tuple2, true])._2).value) AS _2#929]
+- FlatMapGroupsWithState $line140.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$i
w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$
$iw$$iw$$iw$$iw$$Lambda$4117/181063008@d2cdc82, value#923: bigint, newInstance(class s
cala.Tuple2), [value#923L], [time#915-T10000ms, value#916L], obj#927: scala.Tuple2, st
ate info [ checkpoint = <unknown>, runId = 9af3d00c-fe1f-46a0-8630-4e0d0af88042, opId
= 0, ver = 0, numPartitions = 1], class[value[0]: bigint], 2, Update, EventTimeTimeout
, 0, 0
+- *(1) Sort [value#923L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value#923L, 1)
+- AppendColumns $line140.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$i
w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$
$iw$$iw$$iw$$iw$$Lambda$4118/2131767153@3e606b4c, newInstance(class scala.Tuple2), [in
put[0, bigint, false] AS value#923L]
+- EventTimeWatermark time#915: timestamp, interval 10 seconds
+- StreamingRelation MemoryStream[time#915,value#916L], [time#915, v
alue#916L]
*/

val queryName = "FlatMapGroupsWithStateExec_demo"


val checkpointLocation = s"/tmp/checkpoint-$queryName"

68
Internals of FlatMapGroupsWithStateExec Physical Operator

// Delete the checkpoint location from previous executions


import java.nio.file.{Files, FileSystems}
import java.util.Comparator
import scala.collection.JavaConverters._
val path = FileSystems.getDefault.getPath(checkpointLocation)
if (Files.exists(path)) {
Files.walk(path)
.sorted(Comparator.reverseOrder())
.iterator
.asScala
.foreach(p => p.toFile.delete)
}

import org.apache.spark.sql.streaming.OutputMode.Update
val streamingQuery = valuesCounted
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(Update)
.start

assert(streamingQuery.status.message == "Waiting for data to arrive")

// Use web UI to monitor the metrics of the streaming query


// Go to http://localhost:4040/SQL/ and click one of the Completed Queries with Job IDs

// You may also want to check out checkpointed state


// in /tmp/checkpoint-FlatMapGroupsWithStateExec_demo/state/0/0

val batch = Seq(


Event(secs = 1, value = 1),
Event(secs = 15, value = 2))
events.addData(batch)
streamingQuery.processAllAvailable()

/**
>>> keyCounts(key = 1, state = <empty>)
>>> >>> currentProcessingTimeMs: 1561881557237
>>> >>> currentWatermarkMs: 0
>>> >>> hasTimedOut: false
>>> keyCounts(key = 2, state = <empty>)
>>> >>> currentProcessingTimeMs: 1561881557237
>>> >>> currentWatermarkMs: 0
>>> >>> hasTimedOut: false
*/

spark.table(queryName).show(truncate = false)
/**
+-----+-----+
|value|count|
+-----+-----+

69
Internals of FlatMapGroupsWithStateExec Physical Operator

|1 |[1] |
|2 |[1] |
+-----+-----+
*/

// With at least one execution we can review the execution plan


streamingQuery.explain
/**
== Physical Plan ==
*(2) Project [_1#928L AS value#931L, _2#929 AS count#932]
+- *(2) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#928L
, if (isnull(assertnotnull(input[0, scala.Tuple2, true])._2)) null else named_struct(v
alue, assertnotnull(assertnotnull(input[0, scala.Tuple2, true])._2).value) AS _2#929]
+- FlatMapGroupsWithState $line140.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$i
w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$
$iw$$iw$$iw$$iw$$Lambda$4117/181063008@d2cdc82, value#923: bigint, newInstance(class s
cala.Tuple2), [value#923L], [time#915-T10000ms, value#916L], obj#927: scala.Tuple2, st
ate info [ checkpoint = file:/tmp/checkpoint-FlatMapGroupsWithStateExec_demo/state, ru
nId = 95c3917c-2fd7-45b2-86f6-6c01f0115e1d, opId = 0, ver = 1, numPartitions = 1], cla
ss[value[0]: bigint], 2, Update, EventTimeTimeout, 1561881557499, 5000
+- *(1) Sort [value#923L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value#923L, 1)
+- AppendColumns $line140.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$i
w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$
$iw$$iw$$iw$$iw$$Lambda$4118/2131767153@3e606b4c, newInstance(class scala.Tuple2), [in
put[0, bigint, false] AS value#923L]
+- EventTimeWatermark time#915: timestamp, interval 10 seconds
+- LocalTableScan <empty>, [time#915, value#916L]
*/

type Millis = Long


def toMillis(datetime: String): Millis = {
import java.time.format.DateTimeFormatter
import java.time.LocalDateTime
import java.time.ZoneOffset
LocalDateTime
.parse(datetime, DateTimeFormatter.ISO_DATE_TIME)
.toInstant(ZoneOffset.UTC)
.toEpochMilli
}

val currentWatermark = streamingQuery.lastProgress.eventTime.get("watermark")


val currentWatermarkSecs = toMillis(currentWatermark).millis.toSeconds.seconds

val expectedWatermarkSecs = 5.seconds


assert(currentWatermarkSecs == expectedWatermarkSecs, s"Current event-time watermark i
s $currentWatermarkSecs, but should be $expectedWatermarkSecs (maximum event time - de
layThreshold ${delayThreshold.toMillis})")

// Let's access the FlatMapGroupsWithStateExec physical operator


import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
import org.apache.spark.sql.execution.streaming.StreamExecution
val engine: StreamExecution = streamingQuery

70
Internals of FlatMapGroupsWithStateExec Physical Operator

.asInstanceOf[StreamingQueryWrapper]
.streamingQuery

import org.apache.spark.sql.execution.streaming.IncrementalExecution
val lastMicroBatch: IncrementalExecution = engine.lastExecution

// Access executedPlan that is the optimized physical query plan ready for execution
// All streaming optimizations have been applied at this point
val plan = lastMicroBatch.executedPlan

// Find the FlatMapGroupsWithStateExec physical operator


import org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec
val flatMapOp = plan.collect { case op: FlatMapGroupsWithStateExec => op }.head

// Display metrics
import org.apache.spark.sql.execution.metric.SQLMetric
def formatMetrics(name: String, metric: SQLMetric) = {
val desc = metric.name.getOrElse("")
val value = metric.value
f"| $name%-30s | $desc%-69s | $value%-10s"
}
flatMapOp.metrics.map { case (name, metric) => formatMetrics(name, metric) }.foreach(p
rintln)
/**
| numTotalStateRows | number of total state rows
| 0
| stateMemory | memory used by state total (min, med, max)
| 390
| loadedMapCacheHitCount | count of cache hit on states cache in provider
| 1
| numOutputRows | number of output rows
| 0
| stateOnCurrentVersionSizeBytes | estimated size of state only on current version tot
al (min, med, max) | 102
| loadedMapCacheMissCount | count of cache miss on states cache in provider
| 0
| commitTimeMs | time to commit changes total (min, med, max)
| -2
| allRemovalsTimeMs | total time to remove rows total (min, med, max)
| -2
| numUpdatedStateRows | number of updated state rows
| 0
| allUpdatesTimeMs | total time to update rows total (min, med, max)
| -2
*/

val batch = Seq(


Event(secs = 1, value = 1), // under the watermark (5000 ms) so it's disregarded
Event(secs = 6, value = 3)) // above the watermark so it should be counted
events.addData(batch)
streamingQuery.processAllAvailable()

/**

71
Internals of FlatMapGroupsWithStateExec Physical Operator

>>> keyCounts(key = 3, state = <empty>)


>>> >>> currentProcessingTimeMs: 1561881643568
>>> >>> currentWatermarkMs: 5000
>>> >>> hasTimedOut: false
*/

spark.table(queryName).show(truncate = false)
/**
+-----+-----+
|value|count|
+-----+-----+
|1 |[1] |
|2 |[1] |
|3 |[1] |
+-----+-----+
*/

val batch = Seq(


Event(secs = 17, value = 3)) // advances the watermark
events.addData(batch)
streamingQuery.processAllAvailable()

/**
>>> keyCounts(key = 3, state = <empty>)
>>> >>> currentProcessingTimeMs: 1561881672887
>>> >>> currentWatermarkMs: 5000
>>> >>> hasTimedOut: false
*/

val currentWatermark = streamingQuery.lastProgress.eventTime.get("watermark")


val currentWatermarkSecs = toMillis(currentWatermark).millis.toSeconds.seconds

val expectedWatermarkSecs = 7.seconds


assert(currentWatermarkSecs == expectedWatermarkSecs, s"Current event-time watermark i
s $currentWatermarkSecs, but should be $expectedWatermarkSecs (maximum event time - de
layThreshold ${delayThreshold.toMillis})")

spark.table(queryName).show(truncate = false)
/**
+-----+-----+
|value|count|
+-----+-----+
|1 |[1] |
|2 |[1] |
|3 |[1] |
|3 |[1] |
+-----+-----+
*/

val batch = Seq(


Event(secs = 18, value = 3)) // advances the watermark
events.addData(batch)
streamingQuery.processAllAvailable()

72
Internals of FlatMapGroupsWithStateExec Physical Operator

/**
>>> keyCounts(key = 3, state = <empty>)
>>> >>> currentProcessingTimeMs: 1561881778165
>>> >>> currentWatermarkMs: 7000
>>> >>> hasTimedOut: false
*/

// Eventually...
streamingQuery.stop()

73
Arbitrary Stateful Streaming Aggregation with
KeyValueGroupedDataset.flatMapGroupsWithState Operator

Demo: Arbitrary Stateful Streaming


Aggregation with
KeyValueGroupedDataset.flatMapGroupsWithS
tate Operator
The following demo shows an example of Arbitrary Stateful Streaming Aggregation with
KeyValueGroupedDataset.flatMapGroupsWithState operator.

import java.sql.Timestamp
type DeviceId = Long
case class Signal(timestamp: Timestamp, deviceId: DeviceId, value: Long)

// input stream
import org.apache.spark.sql.functions._
val signals = spark
.readStream
.format("rate")
.option("rowsPerSecond", 1)
.load
.withColumn("deviceId", rint(rand() * 10) cast "int") // 10 devices randomly assigne
d to values
.withColumn("value", $"value" % 10) // randomize the values (just for fun)
.as[Signal] // convert to our type (from "unpleasant" Row)

import org.apache.spark.sql.streaming.GroupState
type Key = Int
type Count = Long
type State = Map[Key, Count]
case class EventsCounted(deviceId: DeviceId, count: Long)
def countValuesPerDevice(
deviceId: Int,
signals: Iterator[Signal],
state: GroupState[State]): Iterator[EventsCounted] = {
val values = signals.toSeq
println(s"Device: $deviceId")
println(s"Signals (${values.size}):")
values.zipWithIndex.foreach { case (v, idx) => println(s"$idx. $v") }
println(s"State: $state")

// update the state with the count of elements for the key
val initialState: State = Map(deviceId -> 0)
val oldState = state.getOption.getOrElse(initialState)
// the name to highlight that the state is for the key only
val newValue = oldState(deviceId) + values.size
val newState = Map(deviceId -> newValue)
state.update(newState)

74
Arbitrary Stateful Streaming Aggregation with
KeyValueGroupedDataset.flatMapGroupsWithState Operator

// you must not return as it's already consumed


// that leads to a very subtle error where no elements are in an iterator
// iterators are one-pass data structures
Iterator(EventsCounted(deviceId, newValue))
}

// stream processing using flatMapGroupsWithState operator


val deviceId: Signal => DeviceId = { case Signal(_, deviceId, _) => deviceId }
val signalsByDevice = signals.groupByKey(deviceId)

import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode}


val signalCounter = signalsByDevice.flatMapGroupsWithState(
outputMode = OutputMode.Append,
timeoutConf = GroupStateTimeout.NoTimeout)(countValuesPerDevice)

import org.apache.spark.sql.streaming.{OutputMode, Trigger}


import scala.concurrent.duration._
val sq = signalCounter.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Append).
start

75
Exploring Checkpointed State

Demo: Exploring Checkpointed State


The following demo shows the internals of the checkpointed state of a stateful streaming
query.

The demo uses the state checkpoint directory that was used in Demo: Streaming Watermark
with Aggregation in Append Output Mode.

// Change the path to match your configuration


val checkpointRootLocation = "/tmp/checkpoint-watermark_demo/state"
val version = 1L

import org.apache.spark.sql.execution.streaming.state.StateStoreId
val storeId = StateStoreId(
checkpointRootLocation,
operatorId = 0,
partitionId = 0)

// The key and value schemas should match the watermark demo
// .groupBy(window($"time", windowDuration.toString) as "sliding_window")
import org.apache.spark.sql.types.{TimestampType, StructField, StructType}
val keySchema = StructType(
StructField("sliding_window",
StructType(
StructField("start", TimestampType, nullable = true) ::
StructField("end", TimestampType, nullable = true) :: Nil),
nullable = false) :: Nil)
scala> keySchema.printTreeString
root
|-- sliding_window: struct (nullable = false)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)

// .agg(collect_list("batch") as "batches", collect_list("value") as "values")


import org.apache.spark.sql.types.{ArrayType, LongType}
val valueSchema = StructType(
StructField("batches", ArrayType(LongType, true), true) ::
StructField("values", ArrayType(LongType, true), true) :: Nil)
scala> valueSchema.printTreeString
root
|-- batches: array (nullable = true)
| |-- element: long (containsNull = true)
|-- values: array (nullable = true)
| |-- element: long (containsNull = true)

val indexOrdinal = None


import org.apache.spark.sql.execution.streaming.state.StateStoreConf
val storeConf = StateStoreConf(spark.sessionState.conf)
val hadoopConf = spark.sessionState.newHadoopConf()

76
Exploring Checkpointed State

import org.apache.spark.sql.execution.streaming.state.StateStoreProvider
val provider = StateStoreProvider.createAndInit(
storeId, keySchema, valueSchema, indexOrdinal, storeConf, hadoopConf)

// You may want to use the following higher-level code instead


import java.util.UUID
val queryRunId = UUID.randomUUID
import org.apache.spark.sql.execution.streaming.state.StateStoreProviderId
val storeProviderId = StateStoreProviderId(storeId, queryRunId)
import org.apache.spark.sql.execution.streaming.state.StateStore
val store = StateStore.get(
storeProviderId,
keySchema,
valueSchema,
indexOrdinal,
version,
storeConf,
hadoopConf)

import org.apache.spark.sql.execution.streaming.state.UnsafeRowPair
def formatRowPair(rowPair: UnsafeRowPair) = {
s"(${rowPair.key.getLong(0)}, ${rowPair.value.getLong(0)})"
}
store.iterator.map(formatRowPair).foreach(println)

// WIP: Missing value (per window)


def formatRowPair(rowPair: UnsafeRowPair) = {
val window = rowPair.key.getStruct(0, 2)
import scala.concurrent.duration._
val begin = window.getLong(0).millis.toSeconds
val end = window.getLong(1).millis.toSeconds

val value = rowPair.value.getStruct(0, 4)


// input is (time, value, batch) all longs
val t = value.getLong(1).millis.toSeconds
val v = value.getLong(2)
val b = value.getLong(3)
s"(key: [$begin, $end], ($t, $v, $b))"
}
store.iterator.map(formatRowPair).foreach(println)

77
Streaming Watermark with Aggregation in Append Output Mode

Demo: Streaming Watermark with Aggregation


in Append Output Mode
The following demo shows the behaviour and the internals of streaming watermark with a
streaming aggregation in Append output mode.

The demo also shows the behaviour and the internals of StateStoreSaveExec physical
operator in Append output mode.

The below code is part of StreamingAggregationAppendMode streaming


Tip
application.

// Reduce the number of partitions and hence the state stores


// That is supposed to make debugging state checkpointing easier
val numShufflePartitions = 1
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, numShufflePartitions)
assert(spark.sessionState.conf.numShufflePartitions == numShufflePartitions)

// Define event "format"


// Use :paste mode in spark-shell
import java.sql.Timestamp
case class Event(time: Timestamp, value: Long, batch: Long)
import scala.concurrent.duration._
object Event {
def apply(secs: Long, value: Long, batch: Long): Event = {
Event(new Timestamp(secs.seconds.toMillis), value, batch)
}
}

// Using memory data source for full control of the input


import org.apache.spark.sql.execution.streaming.MemoryStream
implicit val sqlCtx = spark.sqlContext
val events = MemoryStream[Event]
val values = events.toDS
assert(values.isStreaming, "values must be a streaming Dataset")

values.printSchema
/**
root
|-- time: timestamp (nullable = true)
|-- value: long (nullable = false)
|-- batch: long (nullable = false)
*/

// Streaming aggregation using groupBy operator to demo StateStoreSaveExec operator


// Define required watermark for late events for Append output mode

78
Streaming Watermark with Aggregation in Append Output Mode

import scala.concurrent.duration._
val delayThreshold = 10.seconds
val eventTime = "time"

val valuesWatermarked = values


.withWatermark(eventTime, delayThreshold.toString) // defines watermark (before grou
pBy!)

// EventTimeWatermark logical operator is planned as EventTimeWatermarkExec physical o


perator
// Note that as a physical operator EventTimeWatermarkExec shows itself without the Ex
ec suffix
valuesWatermarked.explain
/**
== Physical Plan ==
EventTimeWatermark time#3: timestamp, interval 10 seconds
+- StreamingRelation MemoryStream[time#3,value#4L,batch#5L], [time#3, value#4L, batch#
5L]
*/

val windowDuration = 5.seconds


import org.apache.spark.sql.functions.window
val countsPer5secWindow = valuesWatermarked
.groupBy(window(col(eventTime), windowDuration.toString) as "sliding_window")
.agg(collect_list("batch") as "batches", collect_list("value") as "values")

countsPer5secWindow.printSchema
/**
root
|-- sliding_window: struct (nullable = false)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- batches: array (nullable = true)
| |-- element: long (containsNull = true)
|-- values: array (nullable = true)
| |-- element: long (containsNull = true)
*/

// valuesPerGroupWindowed is a streaming Dataset with just one source


// It knows nothing about output mode or watermark yet
// That's why StatefulOperatorStateInfo is generic
// and no batch-specific values are printed out
// That will be available after the first streaming batch
// Use sq.explain to know the runtime-specific values
countsPer5secWindow.explain
/**
== Physical Plan ==
ObjectHashAggregate(keys=[window#23-T10000ms], functions=[collect_list(batch#5L, 0, 0)
, collect_list(value#4L, 0, 0)])
+- StateStoreSave [window#23-T10000ms], state info [ checkpoint = <unknown>, runId = 5
0e62943-fe5d-4a02-8498-7134ecbf5122, opId = 0, ver = 0, numPartitions = 1], Append, 0,
2
+- ObjectHashAggregate(keys=[window#23-T10000ms], functions=[merge_collect_list(bat

79
Streaming Watermark with Aggregation in Append Output Mode

ch#5L, 0, 0), merge_collect_list(value#4L, 0, 0)])


+- StateStoreRestore [window#23-T10000ms], state info [ checkpoint = <unknown>,
runId = 50e62943-fe5d-4a02-8498-7134ecbf5122, opId = 0, ver = 0, numPartitions = 1], 2
+- ObjectHashAggregate(keys=[window#23-T10000ms], functions=[merge_collect_li
st(batch#5L, 0, 0), merge_collect_list(value#4L, 0, 0)])
+- Exchange hashpartitioning(window#23-T10000ms, 1)
+- ObjectHashAggregate(keys=[window#23-T10000ms], functions=[partial_co
llect_list(batch#5L, 0, 0), partial_collect_list(value#4L, 0, 0)])
+- *(1) Project [named_struct(start, precisetimestampconversion(((((
CASE WHEN (cast(CEIL((cast((precisetimestampconversion(time#3-T10000ms, TimestampType,
LongType) - 0) as double) / 5000000.0)) as double) = (cast((precisetimestampconversio
n(time#3-T10000ms, TimestampType, LongType) - 0) as double) / 5000000.0)) THEN (CEIL((
cast((precisetimestampconversion(time#3-T10000ms, TimestampType, LongType) - 0) as dou
ble) / 5000000.0)) + 1) ELSE CEIL((cast((precisetimestampconversion(time#3-T10000ms, T
imestampType, LongType) - 0) as double) / 5000000.0)) END + 0) - 1) * 5000000) + 0), L
ongType, TimestampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cas
t((precisetimestampconversion(time#3-T10000ms, TimestampType, LongType) - 0) as double
) / 5000000.0)) as double) = (cast((precisetimestampconversion(time#3-T10000ms, Timest
ampType, LongType) - 0) as double) / 5000000.0)) THEN (CEIL((cast((precisetimestampcon
version(time#3-T10000ms, TimestampType, LongType) - 0) as double) / 5000000.0)) + 1) E
LSE CEIL((cast((precisetimestampconversion(time#3-T10000ms, TimestampType, LongType) -
0) as double) / 5000000.0)) END + 0) - 1) * 5000000) + 5000000), LongType, TimestampT
ype)) AS window#23-T10000ms, value#4L, batch#5L]
+- *(1) Filter isnotnull(time#3-T10000ms)
+- EventTimeWatermark time#3: timestamp, interval 10 seconds
+- StreamingRelation MemoryStream[time#3,value#4L,batch#5L]
, [time#3, value#4L, batch#5L]
*/

val queryName = "watermark_demo"


val checkpointLocation = s"/tmp/checkpoint-$queryName"

// Delete the checkpoint location from previous executions


import java.nio.file.{Files, FileSystems}
import java.util.Comparator
import scala.collection.JavaConverters._
val path = FileSystems.getDefault.getPath(checkpointLocation)
if (Files.exists(path)) {
Files.walk(path)
.sorted(Comparator.reverseOrder())
.iterator
.asScala
.foreach(p => p.toFile.delete)
}

// FIXME Use foreachBatch for batchId and the output Dataset


// Start the query and hence StateStoreSaveExec
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.OutputMode
val streamingQuery = countsPer5secWindow
.writeStream
.format("memory")
.queryName(queryName)

80
Streaming Watermark with Aggregation in Append Output Mode

.option("checkpointLocation", checkpointLocation)
.outputMode(OutputMode.Append) // <-- Use Append output mode
.start

assert(streamingQuery.status.message == "Waiting for data to arrive")

type Millis = Long


def toMillis(datetime: String): Millis = {
import java.time.format.DateTimeFormatter
import java.time.LocalDateTime
import java.time.ZoneOffset
LocalDateTime
.parse(datetime, DateTimeFormatter.ISO_DATE_TIME)
.toInstant(ZoneOffset.UTC)
.toEpochMilli
}

// Use web UI to monitor the state of state (no pun intended)


// StateStoreSave and StateStoreRestore operators all have state metrics
// Go to http://localhost:4040/SQL/ and click one of the Completed Queries with Job IDs

// You may also want to check out checkpointed state


// in /tmp/checkpoint-watermark_demo/state/0/0

// The demo is aimed to show the following:


// 1. The current watermark
// 2. Check out the stats:
// - expired state (below the current watermark, goes to output and purged later)
// - late state (dropped as if never received and processed)
// - saved state rows (above the current watermark)

val batch = Seq(


Event(1, 1, batch = 1),
Event(15, 2, batch = 1))
events.addData(batch)
streamingQuery.processAllAvailable()

println(streamingQuery.lastProgress.stateOperators(0).prettyJson)
/**
{
"numRowsTotal" : 1,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 1102,
"customMetrics" : {
"loadedMapCacheHitCount" : 2,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : 414
}
}
*/

val currentWatermark = streamingQuery.lastProgress.eventTime.get("watermark")

81
Streaming Watermark with Aggregation in Append Output Mode

val currentWatermarkMs = toMillis(currentWatermark)

val maxTime = batch.maxBy(_.time.toInstant.toEpochMilli).time.toInstant.toEpochMilli.m


illis.toSeconds
val expectedMaxTime = 15
assert(maxTime == expectedMaxTime, s"Maximum time across events per batch is $maxTime,
but should be $expectedMaxTime")

val expectedWatermarkMs = 5.seconds.toMillis


assert(currentWatermarkMs == expectedWatermarkMs, s"Current event-time watermark is $c
urrentWatermarkMs, but should be $expectedWatermarkMs (maximum event time ${maxTime.se
conds.toMillis} minus delayThreshold ${delayThreshold.toMillis})")

// FIXME Saved State Rows


// Use the metrics of the StateStoreSave operator
// Or simply streamingQuery.lastProgress.stateOperators.head
spark.table(queryName).orderBy("sliding_window").show(truncate = false)
/**
+------------------------------------------+-------+------+
|sliding_window |batches|values|
+------------------------------------------+-------+------+
|[1970-01-01 01:00:00, 1970-01-01 01:00:05]|[1] |[1] |
+------------------------------------------+-------+------+
*/

// With at least one execution we can review the execution plan


streamingQuery.explain
/**
scala> streamingQuery.explain
== Physical Plan ==
ObjectHashAggregate(keys=[window#18-T10000ms], functions=[collect_list(batch#5L, 0, 0)
, collect_list(value#4L, 0, 0)])
+- StateStoreSave [window#18-T10000ms], state info [ checkpoint = file:/tmp/checkpoint
-watermark_demo/state, runId = 73bb0ede-20f2-400d-8003-aa2fbebdd2e1, opId = 0, ver = 1
, numPartitions = 1], Append, 5000, 2
+- ObjectHashAggregate(keys=[window#18-T10000ms], functions=[merge_collect_list(bat
ch#5L, 0, 0), merge_collect_list(value#4L, 0, 0)])
+- StateStoreRestore [window#18-T10000ms], state info [ checkpoint = file:/tmp/c
heckpoint-watermark_demo/state, runId = 73bb0ede-20f2-400d-8003-aa2fbebdd2e1, opId = 0
, ver = 1, numPartitions = 1], 2
+- ObjectHashAggregate(keys=[window#18-T10000ms], functions=[merge_collect_li
st(batch#5L, 0, 0), merge_collect_list(value#4L, 0, 0)])
+- Exchange hashpartitioning(window#18-T10000ms, 1)
+- ObjectHashAggregate(keys=[window#18-T10000ms], functions=[partial_co
llect_list(batch#5L, 0, 0), partial_collect_list(value#4L, 0, 0)])
+- *(1) Project [named_struct(start, precisetimestampconversion(((((
CASE WHEN (cast(CEIL((cast((precisetimestampconversion(time#3-T10000ms, TimestampType,
LongType) - 0) as double) / 5000000.0)) as double) = (cast((precisetimestampconversio
n(time#3-T10000ms, TimestampType, LongType) - 0) as double) / 5000000.0)) THEN (CEIL((
cast((precisetimestampconversion(time#3-T10000ms, TimestampType, LongType) - 0) as dou
ble) / 5000000.0)) + 1) ELSE CEIL((cast((precisetimestampconversion(time#3-T10000ms, T
imestampType, LongType) - 0) as double) / 5000000.0)) END + 0) - 1) * 5000000) + 0), L
ongType, TimestampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cas

82
Streaming Watermark with Aggregation in Append Output Mode

t((precisetimestampconversion(time#3-T10000ms, TimestampType, LongType) - 0) as double


) / 5000000.0)) as double) = (cast((precisetimestampconversion(time#3-T10000ms, Timest
ampType, LongType) - 0) as double) / 5000000.0)) THEN (CEIL((cast((precisetimestampcon
version(time#3-T10000ms, TimestampType, LongType) - 0) as double) / 5000000.0)) + 1) E
LSE CEIL((cast((precisetimestampconversion(time#3-T10000ms, TimestampType, LongType) -
0) as double) / 5000000.0)) END + 0) - 1) * 5000000) + 5000000), LongType, TimestampT
ype)) AS window#18-T10000ms, value#4L, batch#5L]
+- *(1) Filter isnotnull(time#3-T10000ms)
+- EventTimeWatermark time#3: timestamp, interval 10 seconds
+- LocalTableScan <empty>, [time#3, value#4L, batch#5L]
*/

import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val engine = streamingQuery
.asInstanceOf[StreamingQueryWrapper]
.streamingQuery
import org.apache.spark.sql.execution.streaming.StreamExecution
assert(engine.isInstanceOf[StreamExecution])

val lastMicroBatch = engine.lastExecution


import org.apache.spark.sql.execution.streaming.IncrementalExecution
assert(lastMicroBatch.isInstanceOf[IncrementalExecution])

// Access executedPlan that is the optimized physical query plan ready for execution
// All streaming optimizations have been applied at this point
// We just need the EventTimeWatermarkExec physical operator
val plan = lastMicroBatch.executedPlan

// Let's find the EventTimeWatermarkExec physical operator in the plan


// There should be one only
import org.apache.spark.sql.execution.streaming.EventTimeWatermarkExec
val watermarkOp = plan.collect { case op: EventTimeWatermarkExec => op }.head

// Let's check out the event-time watermark stats


// They correspond to the concrete EventTimeWatermarkExec operator for a micro-batch
val stats = watermarkOp.eventTimeStats.value
import org.apache.spark.sql.execution.streaming.EventTimeStats
assert(stats.isInstanceOf[EventTimeStats])

println(stats)
/**
EventTimeStats(-9223372036854775808,9223372036854775807,0.0,0)
*/

val batch = Seq(


Event(1, 1, batch = 2),
Event(15, 2, batch = 2),
Event(35, 3, batch = 2))
events.addData(batch)
streamingQuery.processAllAvailable()

val currentWatermark = streamingQuery.lastProgress.eventTime.get("watermark")


val currentWatermarkMs = toMillis(currentWatermark)

83
Streaming Watermark with Aggregation in Append Output Mode

val maxTime = batch.maxBy(_.time.toInstant.toEpochMilli).time.toInstant.toEpochMilli.m


illis.toSeconds
val expectedMaxTime = 35
assert(maxTime == expectedMaxTime, s"Maximum time across events per batch is $maxTime,
but should be $expectedMaxTime")

val expectedWatermarkMs = 25.seconds.toMillis


assert(currentWatermarkMs == expectedWatermarkMs, s"Current event-time watermark is $c
urrentWatermarkMs, but should be $expectedWatermarkMs (maximum event time ${maxTime.se
conds.toMillis} minus delayThreshold ${delayThreshold.toMillis})")

// FIXME Expired State


// FIXME Late Events
// FIXME Saved State Rows
spark.table(queryName).orderBy("sliding_window").show(truncate = false)
/**
+------------------------------------------+-------+------+
|sliding_window |batches|values|
+------------------------------------------+-------+------+
|[1970-01-01 01:00:00, 1970-01-01 01:00:05]|[1] |[1] |
|[1970-01-01 01:00:15, 1970-01-01 01:00:20]|[1, 2] |[2, 2]|
+------------------------------------------+-------+------+
*/

// Check out the event-time watermark stats


val plan = engine.lastExecution.executedPlan
import org.apache.spark.sql.execution.streaming.EventTimeWatermarkExec
val watermarkOp = plan.collect { case op: EventTimeWatermarkExec => op }.head
val stats = watermarkOp.eventTimeStats.value
import org.apache.spark.sql.execution.streaming.EventTimeStats
assert(stats.isInstanceOf[EventTimeStats])

println(stats)
/**
EventTimeStats(-9223372036854775808,9223372036854775807,0.0,0)
*/

val batch = Seq(


Event(15,1, batch = 3),
Event(15,2, batch = 3),
Event(20,3, batch = 3),
Event(26,4, batch = 3))
events.addData(batch)
streamingQuery.processAllAvailable()

val currentWatermark = streamingQuery.lastProgress.eventTime.get("watermark")


val currentWatermarkMs = toMillis(currentWatermark)

val maxTime = batch.maxBy(_.time.toInstant.toEpochMilli).time.toInstant.toEpochMilli.m


illis.toSeconds
val expectedMaxTime = 26
assert(maxTime == expectedMaxTime, s"Maximum time across events per batch is $maxTime,

84
Streaming Watermark with Aggregation in Append Output Mode

but should be $expectedMaxTime")

// Current event-time watermark should be the same as previously


// val expectedWatermarkMs = 25.seconds.toMillis
// The current max time is merely 26 so subtracting delayThreshold gives merely 16
assert(currentWatermarkMs == expectedWatermarkMs, s"Current event-time watermark is $c
urrentWatermarkMs, but should be $expectedWatermarkMs (maximum event time ${maxTime.se
conds.toMillis} minus delayThreshold ${delayThreshold.toMillis})")

// FIXME Expired State


// FIXME Late Events
// FIXME Saved State Rows
spark.table(queryName).orderBy("sliding_window").show(truncate = false)
/**
+------------------------------------------+-------+------+
|sliding_window |batches|values|
+------------------------------------------+-------+------+
|[1970-01-01 01:00:00, 1970-01-01 01:00:05]|[1] |[1] |
|[1970-01-01 01:00:15, 1970-01-01 01:00:20]|[1, 2] |[2, 2]|
+------------------------------------------+-------+------+
*/

// Check out the event-time watermark stats


val plan = engine.lastExecution.executedPlan
import org.apache.spark.sql.execution.streaming.EventTimeWatermarkExec
val watermarkOp = plan.collect { case op: EventTimeWatermarkExec => op }.head
val stats = watermarkOp.eventTimeStats.value
import org.apache.spark.sql.execution.streaming.EventTimeStats
assert(stats.isInstanceOf[EventTimeStats])

println(stats)
/**
EventTimeStats(26000,15000,19000.0,4)
*/

val batch = Seq(


Event(36, 1, batch = 4))
events.addData(batch)
streamingQuery.processAllAvailable()

val currentWatermark = streamingQuery.lastProgress.eventTime.get("watermark")


val currentWatermarkMs = toMillis(currentWatermark)

val maxTime = batch.maxBy(_.time.toInstant.toEpochMilli).time.toInstant.toEpochMilli.m


illis.toSeconds
val expectedMaxTime = 36
assert(maxTime == expectedMaxTime, s"Maximum time across events per batch is $maxTime,
but should be $expectedMaxTime")

val expectedWatermarkMs = 26.seconds.toMillis


assert(currentWatermarkMs == expectedWatermarkMs, s"Current event-time watermark is $c
urrentWatermarkMs, but should be $expectedWatermarkMs (maximum event time ${maxTime.se
conds.toMillis} minus delayThreshold ${delayThreshold.toMillis})")

85
Streaming Watermark with Aggregation in Append Output Mode

// FIXME Expired State


// FIXME Late Events
// FIXME Saved State Rows
spark.table(queryName).orderBy("sliding_window").show(truncate = false)
/**
+------------------------------------------+-------+------+
|sliding_window |batches|values|
+------------------------------------------+-------+------+
|[1970-01-01 01:00:00, 1970-01-01 01:00:05]|[1] |[1] |
|[1970-01-01 01:00:15, 1970-01-01 01:00:20]|[1, 2] |[2, 2]|
+------------------------------------------+-------+------+
*/

// Check out the event-time watermark stats


val plan = engine.lastExecution.executedPlan
import org.apache.spark.sql.execution.streaming.EventTimeWatermarkExec
val watermarkOp = plan.collect { case op: EventTimeWatermarkExec => op }.head
val stats = watermarkOp.eventTimeStats.value
import org.apache.spark.sql.execution.streaming.EventTimeStats
assert(stats.isInstanceOf[EventTimeStats])

println(stats)
/**
EventTimeStats(-9223372036854775808,9223372036854775807,0.0,0)
*/

val batch = Seq(


Event(50, 1, batch = 5)
)
events.addData(batch)
streamingQuery.processAllAvailable()

val currentWatermark = streamingQuery.lastProgress.eventTime.get("watermark")


val currentWatermarkMs = toMillis(currentWatermark)

val maxTime = batch.maxBy(_.time.toInstant.toEpochMilli).time.toInstant.toEpochMilli.m


illis.toSeconds
val expectedMaxTime = 50
assert(maxTime == expectedMaxTime, s"Maximum time across events per batch is $maxTime,
but should be $expectedMaxTime")

val expectedWatermarkMs = 40.seconds.toMillis


assert(currentWatermarkMs == expectedWatermarkMs, s"Current event-time watermark is $c
urrentWatermarkMs, but should be $expectedWatermarkMs (maximum event time ${maxTime.se
conds.toMillis} minus delayThreshold ${delayThreshold.toMillis})")

// FIXME Expired State


// FIXME Late Events
// FIXME Saved State Rows
spark.table(queryName).orderBy("sliding_window").show(truncate = false)
/**
+------------------------------------------+-------+------+

86
Streaming Watermark with Aggregation in Append Output Mode

|sliding_window |batches|values|
+------------------------------------------+-------+------+
|[1970-01-01 01:00:00, 1970-01-01 01:00:05]|[1] |[1] |
|[1970-01-01 01:00:15, 1970-01-01 01:00:20]|[1, 2] |[2, 2]|
|[1970-01-01 01:00:25, 1970-01-01 01:00:30]|[3] |[4] |
|[1970-01-01 01:00:35, 1970-01-01 01:00:40]|[2, 4] |[3, 1]|
+------------------------------------------+-------+------+
*/

// Check out the event-time watermark stats


val plan = engine.lastExecution.executedPlan
import org.apache.spark.sql.execution.streaming.EventTimeWatermarkExec
val watermarkOp = plan.collect { case op: EventTimeWatermarkExec => op }.head
val stats = watermarkOp.eventTimeStats.value
import org.apache.spark.sql.execution.streaming.EventTimeStats
assert(stats.isInstanceOf[EventTimeStats])

println(stats)
/**
EventTimeStats(-9223372036854775808,9223372036854775807,0.0,0)
*/

// Eventually...
streamingQuery.stop()

87
Streaming Query for Running Counts (Socket Source and Complete Output Mode)

Demo: Streaming Query for Running Counts


(Socket Source and Complete Output Mode)
The following code shows a streaming aggregation (with Dataset.groupBy operator) in
complete output mode that reads text lines from a socket (using socket data source) and
outputs running counts of the words.

The example is "borrowed" from the official documentation of Spark. Changes


Note
and errors are only mine.

Important Run nc -lk 9999 first before running the demo.

// START: Only for easier debugging


// Reduce the number of partitions
// The state is then only for one partition
// which should make monitoring easier
val numShufflePartitions = 1
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, numShufflePartitions)

assert(spark.sessionState.conf.numShufflePartitions == numShufflePartitions)
// END: Only for easier debugging

val lines = spark


.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load

scala> lines.printSchema
root
|-- value: string (nullable = true)

import org.apache.spark.sql.functions.explode
val words = lines
.select(explode(split($"value", """\W+""")) as "word")

val counts = words.groupBy("word").count

scala> counts.printSchema
root
|-- word: string (nullable = true)
|-- count: long (nullable = false)

// nc -lk 9999 is supposed to be up at this point

88
Streaming Query for Running Counts (Socket Source and Complete Output Mode)

val queryName = "running_counts"


val checkpointLocation = s"/tmp/checkpoint-$queryName"

// Delete the checkpoint location from previous executions


import java.nio.file.{Files, FileSystems}
import java.util.Comparator
import scala.collection.JavaConverters._
val path = FileSystems.getDefault.getPath(checkpointLocation)
if (Files.exists(path)) {
Files.walk(path)
.sorted(Comparator.reverseOrder())
.iterator
.asScala
.foreach(p => p.toFile.delete)
}

import org.apache.spark.sql.streaming.OutputMode.Complete
val runningCounts = counts
.writeStream
.format("console")
.option("checkpointLocation", checkpointLocation)
.outputMode(Complete)
.start

scala> runningCounts.explain
== Physical Plan ==
WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@
205f195c
+- *(5) HashAggregate(keys=[word#72], functions=[count(1)])
+- StateStoreSave [word#72], state info [ checkpoint = file:/tmp/checkpoint-running
_counts/state, runId = f3b2e642-1790-4a17-ab61-3d894110b063, opId = 0, ver = 0, numPar
titions = 1], Complete, 0, 2
+- *(4) HashAggregate(keys=[word#72], functions=[merge_count(1)])
+- StateStoreRestore [word#72], state info [ checkpoint = file:/tmp/checkpoin
t-running_counts/state, runId = f3b2e642-1790-4a17-ab61-3d894110b063, opId = 0, ver = 0
, numPartitions = 1], 2
+- *(3) HashAggregate(keys=[word#72], functions=[merge_count(1)])
+- Exchange hashpartitioning(word#72, 1)
+- *(2) HashAggregate(keys=[word#72], functions=[partial_count(1)])
+- Generate explode(split(value#83, \W+)), false, [word#72]
+- *(1) Project [value#83]
+- *(1) ScanV2 socket[value#83] (Options: [host=localhost,p
ort=9999])

// Type lines (words) in the terminal with nc


// Observe the counts in spark-shell

// Use web UI to monitor the state of state (no pun intended)


// StateStoreSave and StateStoreRestore operators all have state metrics
// Go to http://localhost:4040/SQL/ and click one of the Completed Queries with Job IDs

// You may also want to check out checkpointed state

89
Streaming Query for Running Counts (Socket Source and Complete Output Mode)

// in /tmp/checkpoint-running_counts/state/0/0

// Eventually...
runningCounts.stop()

90
Streaming Aggregation with Kafka Data Source

Demo: Streaming Aggregation with Kafka Data


Source
The following example code shows a streaming aggregation (with Dataset.groupBy
operator) that reads records from Kafka (with Kafka Data Source).

Start up Kafka cluster and spark-shell with spark-sql-kafka-0-10 package


Important
before running the demo.

You may want to consider copying the following code to append.txt and using
Tip :load append.txt command in spark-shell to load it (rather than copying and
pasting it).

// START: Only for easier debugging


// The state is then only for one partition
// which should make monitoring easier
val numShufflePartitions = 1
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, numShufflePartitions)

assert(spark.sessionState.conf.numShufflePartitions == numShufflePartitions)
// END: Only for easier debugging

val records = spark


.readStream
.format("kafka")
.option("subscribePattern", """topic-\d{2}""") // topics with two digits at the end
.option("kafka.bootstrap.servers", ":9092")
.load
scala> records.printSchema
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)

// Since the streaming query uses Append output mode


// it has to define a streaming event-time watermark (using Dataset.withWatermark oper
ator)
// UnsupportedOperationChecker makes sure that the requirement holds
val ids = records
.withColumn("tokens", split($"value", ","))
.withColumn("seconds", 'tokens(0) cast "long")
.withColumn("event_time", to_timestamp(from_unixtime('seconds))) // <-- Event time h

91
Streaming Aggregation with Kafka Data Source

as to be a timestamp
.withColumn("id", 'tokens(1))
.withColumn("batch", 'tokens(2) cast "int")
.withWatermark(eventTime = "event_time", delayThreshold = "10 seconds") // <-- defin
e watermark (before groupBy!)
.groupBy($"event_time") // <-- use event_time for grouping
.agg(collect_list("batch") as "batches", collect_list("id") as "ids")
.withColumn("event_time", to_timestamp($"event_time")) // <-- convert to human-reada
ble date
scala> ids.printSchema
root
|-- event_time: timestamp (nullable = true)
|-- batches: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- ids: array (nullable = true)
| |-- element: string (containsNull = true)

assert(ids.isStreaming, "ids is a streaming query")

// ids knows nothing about the output mode or the current streaming watermark yet
// - Output mode is defined on writing side
// - streaming watermark is read from rows at runtime
// That's why StatefulOperatorStateInfo is generic (and uses the default Append for ou
tput mode)
// and no batch-specific values are printed out
// They will be available right after the first streaming batch
// Use explain on a streaming query to know the trigger-specific values
scala> ids.explain
== Physical Plan ==
ObjectHashAggregate(keys=[event_time#118-T10000ms], functions=[collect_list(batch#141,
0, 0), collect_list(id#129, 0, 0)])
+- StateStoreSave [event_time#118-T10000ms], state info [ checkpoint = <unknown>, runI
d = a870e6e2-b925-4104-9886-b211c0be1b73, opId = 0, ver = 0, numPartitions = 1], Append
, 0, 2
+- ObjectHashAggregate(keys=[event_time#118-T10000ms], functions=[merge_collect_lis
t(batch#141, 0, 0), merge_collect_list(id#129, 0, 0)])
+- StateStoreRestore [event_time#118-T10000ms], state info [ checkpoint = <unkno
wn>, runId = a870e6e2-b925-4104-9886-b211c0be1b73, opId = 0, ver = 0, numPartitions = 1
], 2
+- ObjectHashAggregate(keys=[event_time#118-T10000ms], functions=[merge_colle
ct_list(batch#141, 0, 0), merge_collect_list(id#129, 0, 0)])
+- Exchange hashpartitioning(event_time#118-T10000ms, 1)
+- ObjectHashAggregate(keys=[event_time#118-T10000ms], functions=[parti
al_collect_list(batch#141, 0, 0), partial_collect_list(id#129, 0, 0)])
+- EventTimeWatermark event_time#118: timestamp, interval 10 seconds
+- *(1) Project [cast(from_unixtime(cast(split(cast(value#8 as st
ring), ,)[0] as bigint), yyyy-MM-dd HH:mm:ss, Some(Europe/Warsaw)) as timestamp) AS ev
ent_time#118, split(cast(value#8 as string), ,)[1] AS id#129, cast(split(cast(value#8
as string), ,)[2] as int) AS batch#141]
+- StreamingRelation kafka, [key#7, value#8, topic#9, partitio
n#10, offset#11L, timestamp#12, timestampType#13]

val queryName = "ids-kafka"

92
Streaming Aggregation with Kafka Data Source

val checkpointLocation = s"/tmp/checkpoint-$queryName"

// Delete the checkpoint location from previous executions


import java.nio.file.{Files, FileSystems}
import java.util.Comparator
import scala.collection.JavaConverters._
val path = FileSystems.getDefault.getPath(checkpointLocation)
if (Files.exists(path)) {
Files.walk(path)
.sorted(Comparator.reverseOrder())
.iterator
.asScala
.foreach(p => p.toFile.delete)
}

// The following make for an easier demo


// Kafka cluster is supposed to be up at this point
// Make sure that a Kafka topic is available, e.g. topic-00
// Use ./bin/kafka-console-producer.sh --broker-list :9092 --topic topic-00
// And send a record, e.g. 1,1,1

// Define the output mode


// and start the query
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.OutputMode.Append
import org.apache.spark.sql.streaming.Trigger
val streamingQuery = ids
.writeStream
.format("console")
.option("truncate", false)
.option("checkpointLocation", checkpointLocation)
.queryName(queryName)
.outputMode(Append)
.start

val lastProgress = streamingQuery.lastProgress


scala> :type lastProgress
org.apache.spark.sql.streaming.StreamingQueryProgress

assert(lastProgress.stateOperators.length == 1, "There should be one stateful operator"


)

scala> println(lastProgress.stateOperators.head.prettyJson)
{
"numRowsTotal" : 1,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 742,
"customMetrics" : {
"loadedMapCacheHitCount" : 1,
"loadedMapCacheMissCount" : 1,
"stateOnCurrentVersionSizeBytes" : 374
}
}

93
Streaming Aggregation with Kafka Data Source

assert(lastProgress.sources.length == 1, "There should be one streaming source only")


scala> println(lastProgress.sources.head.prettyJson)
{
"description" : "KafkaV2[SubscribePattern[topic-\\d{2}]]",
"startOffset" : {
"topic-00" : {
"0" : 1
}
},
"endOffset" : {
"topic-00" : {
"0" : 1
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
}

// Eventually...
streamingQuery.stop()

94
groupByKey Streaming Aggregation in Update Mode

Demo: groupByKey Streaming Aggregation in


Update Mode
The example shows Dataset.groupByKey streaming operator to count rows in Update output
mode.

In other words, it is an example of using Dataset.groupByKey with count aggregation


function to count customer orders ( T ) per zip code ( K ).

Complete Spark Structured Streaming Application

95
groupByKey Streaming Aggregation in Update Mode

package pl.japila.spark.examples

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.{OutputMode, Trigger}

object GroupByKeyStreamingApp extends App {

val inputTopic = "GroupByKeyApp-input"


val appName = this.getClass.getSimpleName.replace("$", "")

val spark = SparkSession.builder


.master("local[*]")
.appName(appName)
.getOrCreate
import spark.implicits._

case class Order(id: Long, zipCode: String)

// Input (source node)


val orders = spark
.readStream
.format("kafka")
.option("startingOffsets", "latest")
.option("subscribe", inputTopic)
.option("kafka.bootstrap.servers", ":9092")
.load
.select($"offset" as "id", $"value" as "zipCode") // FIXME Use csv, json, avro
.as[Order]

// Processing logic
// groupByKey + count
val byZipCode = (o: Order) => o.zipCode
val ordersByZipCode = orders.groupByKey(byZipCode)

import org.apache.spark.sql.functions.count
val typedCountCol = (count("zipCode") as "count").as[String]
val counts = ordersByZipCode
.agg(typedCountCol)
.select($"value" as "zip_code", $"count")

// Output (sink node)


import scala.concurrent.duration._
counts
.writeStream
.format("console")
.outputMode(OutputMode.Update) // FIXME Use Complete
.queryName(appName)
.trigger(Trigger.ProcessingTime(5.seconds))
.start
.awaitTermination()
}

96
groupByKey Streaming Aggregation in Update Mode

Credits
The example with customer orders and postal codes is borrowed from Apache Beam’s
Using GroupByKey Programming Guide.

97
StateStoreSaveExec with Complete Output Mode

Demo: StateStoreSaveExec with Complete


Output Mode
The following example code shows the behaviour of StateStoreSaveExec in Complete
output mode.

// START: Only for easier debugging


// The state is then only for one partition
// which should make monitoring it easier
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, 1)
scala> spark.sessionState.conf.numShufflePartitions
res1: Int = 1
// END: Only for easier debugging

// Read datasets from a Kafka topic


// ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPS
HOT
// Streaming aggregation using groupBy operator is required to have StateStoreSaveExec
operator
val valuesPerGroup = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
withColumn("tokens", split('value, ",")).
withColumn("group", 'tokens(0)).
withColumn("value", 'tokens(1) cast "int").
select("group", "value").
groupBy($"group").
agg(collect_list("value") as "values").
orderBy($"group".asc)

// valuesPerGroup is a streaming Dataset with just one source


// so it knows nothing about output mode or watermark yet
// That's why StatefulOperatorStateInfo is generic
// and no batch-specific values are printed out
// That will be available after the first streaming batch
// Use sq.explain to know the runtime-specific values
scala> valuesPerGroup.explain
== Physical Plan ==
*Sort [group#25 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(group#25 ASC NULLS FIRST, 1)
+- ObjectHashAggregate(keys=[group#25], functions=[collect_list(value#36, 0, 0)])
+- Exchange hashpartitioning(group#25, 1)
+- StateStoreSave [group#25], StatefulOperatorStateInfo(<unknown>,899f0fd1-b2
02-45cd-9ebd-09101ca90fa8,0,0), Append, 0

98
StateStoreSaveExec with Complete Output Mode

+- ObjectHashAggregate(keys=[group#25], functions=[merge_collect_list(valu
e#36, 0, 0)])
+- Exchange hashpartitioning(group#25, 1)
+- StateStoreRestore [group#25], StatefulOperatorStateInfo(<unknown>,
899f0fd1-b202-45cd-9ebd-09101ca90fa8,0,0)
+- ObjectHashAggregate(keys=[group#25], functions=[merge_collect_
list(value#36, 0, 0)])
+- Exchange hashpartitioning(group#25, 1)
+- ObjectHashAggregate(keys=[group#25], functions=[partial_
collect_list(value#36, 0, 0)])
+- *Project [split(cast(value#1 as string), ,)[0] AS gro
up#25, cast(split(cast(value#1 as string), ,)[1] as int) AS value#36]
+- StreamingRelation kafka, [key#0, value#1, topic#2,
partition#3, offset#4L, timestamp#5, timestampType#6]

// Start the query and hence StateStoreSaveExec


// Use Complete output mode
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = valuesPerGroup.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Complete).
start

-------------------------------------------
Batch: 0
-------------------------------------------
+-----+------+
|group|values|
+-----+------+
+-----+------+

// there's only 1 stateful operator and hence 0 for the index in stateOperators
scala> println(sq.lastProgress.stateOperators(0).prettyJson)
{
"numRowsTotal" : 0,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 60
}

// publish 1 new key-value pair in a single streaming batch


// 0,1

-------------------------------------------
Batch: 1
-------------------------------------------
+-----+------+
|group|values|
+-----+------+
|0 |[1] |

99
StateStoreSaveExec with Complete Output Mode

+-----+------+

// it's Complete output mode so numRowsTotal is the number of keys in the state store
// no keys were available earlier (it's just started!) and so numRowsUpdated is 0
scala> println(sq.lastProgress.stateOperators(0).prettyJson)
{
"numRowsTotal" : 1,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 324
}

// publish new key and old key in a single streaming batch


// new keys
// 1,1
// updates to already-stored keys
// 0,2

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+------+
|group|values|
+-----+------+
|0 |[2, 1]|
|1 |[1] |
+-----+------+

// it's Complete output mode so numRowsTotal is the number of keys in the state store
// no keys were available earlier and so numRowsUpdated is...0?!
// Think it's a BUG as it should've been 1 (for the row 0,2)
// 8/30 Sent out a question to the Spark user mailing list
scala> println(sq.lastProgress.stateOperators(0).prettyJson)
{
"numRowsTotal" : 2,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 572
}

// In the end...
sq.stop

100
StateStoreSaveExec with Update Output Mode

Demo: StateStoreSaveExec with Update Output


Mode
FIXME Example of Update with StateStoreSaveExec (and optional
Caution
watermark)

101
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

Demo: Developing Custom Streaming Sink


(and Monitoring SQL Queries in web UI)
The demo shows the steps to develop a custom streaming sink and use it to monitor
whether and what SQL queries are executed at runtime (using web UI’s SQL tab).

The main motivation was to answer the question Why does a single structured
query run multiple SQL queries per batch? that happened to have turned out
fairly surprising.
Note
You’re very welcome to upvote the question and answers at your earliest
convenience. Thanks!

The steps are as follows:

1. Creating Custom Sink — DemoSink

2. Creating StreamSinkProvider — DemoSinkProvider

3. Optional Sink Registration using META-INF/services

4. build.sbt Definition

5. Packaging DemoSink

6. Using DemoSink in Streaming Query

7. Monitoring SQL Queries using web UI’s SQL Tab

Findings (aka surprises):

1. Custom sinks require that you define a checkpoint location using checkpointLocation
option (or spark.sql.streaming.checkpointLocation Spark property). Remove the
checkpoint directory (or use a different one every start of a streaming query) to have
consistent results.

Creating Custom Sink — DemoSink


A streaming sink follows the Sink contract and a sample implementation could look as
follows.

102
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

package pl.japila.spark.sql.streaming

case class DemoSink(


sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode) extends Sink {

override def addBatch(batchId: Long, data: DataFrame): Unit = {


println(s"addBatch($batchId)")
data.explain()
// Why so many lines just to show the input DataFrame?
data.sparkSession.createDataFrame(
data.sparkSession.sparkContext.parallelize(data.collect()), data.schema)
.show(10)
}
}

Save the file under src/main/scala in your project.

Creating StreamSinkProvider — DemoSinkProvider

package pl.japila.spark.sql.streaming

class DemoSinkProvider extends StreamSinkProvider


with DataSourceRegister {

override def createSink(


sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink = {
DemoSink(sqlContext, parameters, partitionColumns, outputMode)
}

override def shortName(): String = "demo"


}

Save the file under src/main/scala in your project.

Optional Sink Registration using META-INF/services


The step is optional, but greatly improve the experience when using the custom sink so you
can use it by its name (rather than a fully-qualified class name or using a special class name
for the sink provider).

103
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

Create org.apache.spark.sql.sources.DataSourceRegister in META-INF/services directory


with the following content.

pl.japila.spark.sql.streaming.DemoSinkProvider

Save the file under src/main/resources in your project.

build.sbt Definition
If you use my beloved build tool sbt to manage the project, use the following build.sbt .

organization := "pl.japila.spark"
name := "spark-structured-streaming-demo-sink"
version := "0.1"

scalaVersion := "2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

Packaging DemoSink
The step depends on what build tool you use to manage the project. Use whatever
command you use to create a jar file with the above classes compiled and bundled together.

$ sbt package
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /Users/jacek/dev/sandbox/spark-structured-strea
ming-demo-sink/project
[info] Loading settings from build.sbt ...
[info] Set current project to spark-structured-streaming-demo-sink (in build file:/Use
rs/jacek/dev/sandbox/spark-structured-streaming-demo-sink/)
[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/spark-structured-streaming
-demo-sink/target/scala-2.11/classes ...
[info] Done compiling.
[info] Packaging /Users/jacek/dev/sandbox/spark-structured-streaming-demo-sink/target/
scala-2.11/spark-structured-streaming-demo-sink_2.11-0.1.jar ...
[info] Done packaging.
[success] Total time: 5 s, completed Sep 12, 2017 9:34:19 AM

The jar with the sink is /Users/jacek/dev/sandbox/spark-structured-streaming-demo-


sink/target/scala-2.11/spark-structured-streaming-demo-sink_2.11-0.1.jar .

Using DemoSink in Streaming Query

104
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

The following code reads data from the rate source and simply outputs the result to our
custom DemoSink .

// Make sure the DemoSink jar is available


$ ls /Users/jacek/dev/sandbox/spark-structured-streaming-demo-sink/target/scala-2.11/s
park-structured-streaming-demo-sink_2.11-0.1.jar
/Users/jacek/dev/sandbox/spark-structured-streaming-demo-sink/target/scala-2.11/spark-
structured-streaming-demo-sink_2.11-0.1.jar

// "Install" the DemoSink using --jars command-line option


$ ./bin/spark-shell --jars /Users/jacek/dev/sandbox/spark-structured-streaming-custom-
sink/target/scala-2.11/spark-structured-streaming-custom-sink_2.11-0.1.jar

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
val sq = spark.
readStream.
format("rate").
load.
writeStream.
format("demo").
option("checkpointLocation", "/tmp/demo-checkpoint").
trigger(Trigger.ProcessingTime(10.seconds)).
start

// In the end...
scala> sq.stop
17/09/12 09:59:28 INFO StreamExecution: Query [id = 03cd78e3-94e2-439c-9c12-cfed0c9968
12, runId = 6938af91-9806-4404-965a-5ae7525d5d3f] was stopped

Monitoring SQL Queries using web UI’s SQL Tab


Open http://localhost:4040/SQL/.

You should find that every trigger (aka batch) results in 3 SQL queries. Why?

105
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

Figure 1. web UI’s SQL Tab and Completed Queries (3 Queries per Batch)
The answer lies in what sources and sink a streaming query uses (and differs per streaming
query).

In our case, DemoSink collects the rows from the input DataFrame and shows it
afterwards. That gives 2 SQL queries (as you can see after executing the following batch
queries).

// batch non-streaming query


val data = (0 to 3).toDF("id")

// That gives one SQL query


data.collect

// That gives one SQL query, too


data.show

The remaining query (which is the first among the queries) is executed when you load the
data.

That can be observed easily when you change DemoSink to not "touch" the input data (in
addBatch ) in any way.

override def addBatch(batchId: Long, data: DataFrame): Unit = {


println(s"addBatch($batchId)")
}

Re-run the streaming query (using the new DemoSink ) and use web UI’s SQL tab to see the
queries. You should have just one query per batch (and no Spark jobs given nothing is really
done in the sink’s addBatch ).

106
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)

Figure 2. web UI’s SQL Tab and Completed Queries (1 Query per Batch)

107
current_timestamp Function For Processing Time in Streaming Queries

Demo: current_timestamp Function For


Processing Time in Streaming Queries
The demo shows what happens when you use current_timestamp function in your
structured queries.

The main motivation was to answer the question How to achieve ingestion
time? in Spark Structured Streaming.
Note
You’re very welcome to upvote the question and answers at your earliest
convenience. Thanks!

Quoting the Apache Flink documentation:

Event time is the time that each individual event occurred on its producing device. This
time is typically embedded within the records before they enter Flink and that event
timestamp can be extracted from the record.

That is exactly how event time is considered in withWatermark operator which you use to
describe what column to use for event time. The column could be part of the input dataset
or…​generated.

And that is the moment where my confusion starts.

In order to generate the event time column for withWatermark operator you could use
current_timestamp or current_date standard functions.

// rate format gives event time


// but let's generate a brand new column with ours
// for demo purposes
val values = spark.
readStream.
format("rate").
load.
withColumn("current_timestamp", current_timestamp)
scala> values.printSchema
root
|-- timestamp: timestamp (nullable = true)
|-- value: long (nullable = true)
|-- current_timestamp: timestamp (nullable = false)

Both are special for Spark Structured Streaming as StreamExecution replaces their
underlying Catalyst expressions, CurrentTimestamp and CurrentDate respectively, with
CurrentBatchTimestamp expression and the time of the current batch.

108
current_timestamp Function For Processing Time in Streaming Queries

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = values.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
start

// note the value of current_timestamp


// that corresponds to the batch time

-------------------------------------------
Batch: 1
-------------------------------------------
+-----------------------+-----+-------------------+
|timestamp |value|current_timestamp |
+-----------------------+-----+-------------------+
|2017-09-18 10:53:31.523|0 |2017-09-18 10:53:40|
|2017-09-18 10:53:32.523|1 |2017-09-18 10:53:40|
|2017-09-18 10:53:33.523|2 |2017-09-18 10:53:40|
|2017-09-18 10:53:34.523|3 |2017-09-18 10:53:40|
|2017-09-18 10:53:35.523|4 |2017-09-18 10:53:40|
|2017-09-18 10:53:36.523|5 |2017-09-18 10:53:40|
|2017-09-18 10:53:37.523|6 |2017-09-18 10:53:40|
|2017-09-18 10:53:38.523|7 |2017-09-18 10:53:40|
+-----------------------+-----+-------------------+

// Use web UI's SQL tab for the batch (Submitted column)
// or sq.recentProgress
scala> println(sq.recentProgress(1).timestamp)
2017-09-18T08:53:40.000Z

// Note current_batch_timestamp

scala> sq.explain(extended = true)


== Parsed Logical Plan ==
'Project [timestamp#2137, value#2138L, current_batch_timestamp(1505725650005, Timestam
pType, None) AS current_timestamp#50]
+- LogicalRDD [timestamp#2137, value#2138L], true

== Analyzed Logical Plan ==


timestamp: timestamp, value: bigint, current_timestamp: timestamp
Project [timestamp#2137, value#2138L, current_batch_timestamp(1505725650005, Timestamp
Type, Some(Europe/Berlin)) AS current_timestamp#50]
+- LogicalRDD [timestamp#2137, value#2138L], true

== Optimized Logical Plan ==


Project [timestamp#2137, value#2138L, 1505725650005000 AS current_timestamp#50]
+- LogicalRDD [timestamp#2137, value#2138L], true

== Physical Plan ==

109
current_timestamp Function For Processing Time in Streaming Queries

*Project [timestamp#2137, value#2138L, 1505725650005000 AS current_timestamp#50]


+- Scan ExistingRDD[timestamp#2137,value#2138L]

That seems to be closer to processing time than ingestion time given the definition from the
Apache Flink documentation:

Processing time refers to the system time of the machine that is executing the
respective operation.

Ingestion time is the time that events enter Flink.

What do you think?

110
Using StreamingQueryManager for Query Termination Management

Demo: Using StreamingQueryManager for


Query Termination Management
The demo shows how to use StreamingQueryManager (and specifically
awaitAnyTermination and resetTerminated) for query termination management.

demo-StreamingQueryManager.scala

// Save the code as demo-StreamingQueryManager.scala


// Start it using spark-shell
// $ ./bin/spark-shell -i demo-StreamingQueryManager.scala

// Register a StreamingQueryListener to receive notifications about state changes of s


treaming queries
import org.apache.spark.sql.streaming.StreamingQueryListener
val myQueryListener = new StreamingQueryListener {
import org.apache.spark.sql.streaming.StreamingQueryListener._
def onQueryTerminated(event: QueryTerminatedEvent): Unit = {
println(s"Query ${event.id} terminated")
}

def onQueryStarted(event: QueryStartedEvent): Unit = {}


def onQueryProgress(event: QueryProgressEvent): Unit = {}
}
spark.streams.addListener(myQueryListener)

import org.apache.spark.sql.streaming._
import scala.concurrent.duration._

// Start streaming queries

// Start the first query


val q4s = spark.readStream.
format("rate").
load.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(4.seconds)).
option("truncate", false).
start

// Start another query that is slightly slower


val q10s = spark.readStream.
format("rate").
load.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(10.seconds)).
option("truncate", false).

111
Using StreamingQueryManager for Query Termination Management

start

// Both queries run concurrently


// You should see different outputs in the console
// q4s prints out 4 rows every batch and twice as often as q10s
// q10s prints out 10 rows every batch

/*
-------------------------------------------
Batch: 7
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-10-27 13:44:07.462|21 |
|2017-10-27 13:44:08.462|22 |
|2017-10-27 13:44:09.462|23 |
|2017-10-27 13:44:10.462|24 |
+-----------------------+-----+

-------------------------------------------
Batch: 8
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-10-27 13:44:11.462|25 |
|2017-10-27 13:44:12.462|26 |
|2017-10-27 13:44:13.462|27 |
|2017-10-27 13:44:14.462|28 |
+-----------------------+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-10-27 13:44:09.847|6 |
|2017-10-27 13:44:10.847|7 |
|2017-10-27 13:44:11.847|8 |
|2017-10-27 13:44:12.847|9 |
|2017-10-27 13:44:13.847|10 |
|2017-10-27 13:44:14.847|11 |
|2017-10-27 13:44:15.847|12 |
|2017-10-27 13:44:16.847|13 |
|2017-10-27 13:44:17.847|14 |
|2017-10-27 13:44:18.847|15 |
+-----------------------+-----+
*/

// Stop q4s on a separate thread


// as we're about to block the current thread awaiting query termination

112
Using StreamingQueryManager for Query Termination Management

import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit.SECONDS
def queryTerminator(query: StreamingQuery) = new Runnable {
def run = {
println(s"Stopping streaming query: ${query.id}")
query.stop
}
}
import java.util.concurrent.TimeUnit.SECONDS
// Stop the first query after 10 seconds
Executors.newSingleThreadScheduledExecutor.
scheduleWithFixedDelay(queryTerminator(q4s), 10, 60 * 5, SECONDS)
// Stop the other query after 20 seconds
Executors.newSingleThreadScheduledExecutor.
scheduleWithFixedDelay(queryTerminator(q10s), 20, 60 * 5, SECONDS)

// Use StreamingQueryManager to wait for any query termination (either q1 or q2)


// the current thread will block indefinitely until either streaming query has finished

spark.streams.awaitAnyTermination

// You are here only after either streaming query has finished
// Executing spark.streams.awaitAnyTermination again would return immediately

// You should have received the QueryTerminatedEvent for the query termination

// reset the last terminated streaming query


spark.streams.resetTerminated

// You know at least one query has terminated

// Wait for the other query to terminate


spark.streams.awaitAnyTermination

assert(spark.streams.active.isEmpty)

println("The demo went all fine. Exiting...")

// leave spark-shell
System.exit(0)

113
Streaming Aggregation

Streaming Aggregation
In Spark Structured Streaming, a streaming aggregation is a streaming query that was
described (build) using the following high-level streaming operators:

Dataset.groupBy, Dataset.rollup , Dataset.cube (that simply create a


RelationalGroupedDataset )

Dataset.groupByKey (that simply creates a KeyValueGroupedDataset )

SQL’s GROUP BY clause (including WITH CUBE and WITH ROLLUP )

Streaming aggregation belongs to the category of Stateful Stream Processing.

IncrementalExecution — QueryExecution of Streaming
Queries
Under the covers, the high-level operators create a logical query plan with one or more
Aggregate logical operators.

Tip Read up on Aggregate logical operator in The Internals of Spark SQL book.

In Spark Structured Streaming IncrementalExecution is responsible for planning streaming


queries for execution.

At query planning, IncrementalExecution uses the StatefulAggregationStrategy execution


planning strategy for planning streaming aggregations ( Aggregate unary logical operators)
as pairs of StateStoreRestoreExec and StateStoreSaveExec physical operators.

// input data from a data source


// it's rate data source
// but that does not really matter
// We need a streaming Dataset
val input = spark
.readStream
.format("rate")
.load

// Streaming aggregation with groupBy


val counts = input
.groupBy($"value" % 2)
.count

counts.explain(extended = true)
/**
== Parsed Logical Plan ==

114
Streaming Aggregation

'Aggregate [('value % 2)], [('value % 2) AS (value % 2)#23, count(1) AS count#22L]


+- StreamingRelationV2 org.apache.spark.sql.execution.streaming.sources.RateStreamProv
ider@7879348, rate, [timestamp#15, value#16L]

== Analyzed Logical Plan ==


(value % 2): bigint, count: bigint
Aggregate [(value#16L % cast(2 as bigint))], [(value#16L % cast(2 as bigint)) AS (valu
e % 2)#23L, count(1) AS count#22L]
+- StreamingRelationV2 org.apache.spark.sql.execution.streaming.sources.RateStreamProv
ider@7879348, rate, [timestamp#15, value#16L]

== Optimized Logical Plan ==


Aggregate [(value#16L % 2)], [(value#16L % 2) AS (value % 2)#23L, count(1) AS count#22
L]
+- Project [value#16L]
+- StreamingRelationV2 org.apache.spark.sql.execution.streaming.sources.RateStreamP
rovider@7879348, rate, [timestamp#15, value#16L]

== Physical Plan ==
*(4) HashAggregate(keys=[(value#16L % 2)#27L], functions=[count(1)], output=[(value %
2)#23L, count#22L])
+- StateStoreSave [(value#16L % 2)#27L], state info [ checkpoint = <unknown>, runId =
8c0ae2be-5eaa-4038-bc29-a176abfaf885, opId = 0, ver = 0, numPartitions = 200], Append,
0, 2
+- *(3) HashAggregate(keys=[(value#16L % 2)#27L], functions=[merge_count(1)], outpu
t=[(value#16L % 2)#27L, count#29L])
+- StateStoreRestore [(value#16L % 2)#27L], state info [ checkpoint = <unknown>,
runId = 8c0ae2be-5eaa-4038-bc29-a176abfaf885, opId = 0, ver = 0, numPartitions = 200]
, 2
+- *(2) HashAggregate(keys=[(value#16L % 2)#27L], functions=[merge_count(1)],
output=[(value#16L % 2)#27L, count#29L])
+- Exchange hashpartitioning((value#16L % 2)#27L, 200)
+- *(1) HashAggregate(keys=[(value#16L % 2) AS (value#16L % 2)#27L], fu
nctions=[partial_count(1)], output=[(value#16L % 2)#27L, count#29L])
+- *(1) Project [value#16L]
+- StreamingRelation rate, [timestamp#15, value#16L]
*/

Demos
Use the following demos to learn more:

Streaming Watermark with Aggregation in Append Output Mode

Streaming Query for Running Counts (Socket Source and Complete Output Mode)

Streaming Aggregation with Kafka Data Source

groupByKey Streaming Aggregation in Update Mode

115
Streaming Aggregation

116
StateStoreRDD

StateStoreRDD — RDD for Updating State (in


StateStores Across Spark Cluster)
StateStoreRDD is an RDD for executing storeUpdateFunction with StateStore (and data

from partitions of the data RDD).

StateStoreRDD is created for the following stateful physical operators (using

StateStoreOps.mapPartitionsWithStateStore):

FlatMapGroupsWithStateExec

StateStoreRestoreExec

StateStoreSaveExec

StreamingDeduplicateExec

StreamingGlobalLimitExec

Figure 1. StateStoreRDD, Physical and Logical Plans, and operators


StateStoreRDD uses StateStoreCoordinator for the preferred locations of a partition for job

scheduling.

Figure 2. StateStoreRDD and StateStoreCoordinator


getPartitions is exactly the partitions of the data RDD.

Computing Partition —  compute Method

117
StateStoreRDD

compute(
partition: Partition,
ctxt: TaskContext): Iterator[U]

Note compute is part of the RDD Contract to compute a given partition.

compute computes dataRDD passing the result on to storeUpdateFunction (with a

configured StateStore).

Internally, (and similarly to getPreferredLocations) compute creates a StateStoreProviderId


with StateStoreId (using checkpointLocation, operatorId and the index of the input
partition ) and queryRunId.

compute then requests StateStore for the store for the StateStoreProviderId.

In the end, compute computes dataRDD (using the input partition and ctxt ) followed by
executing storeUpdateFunction (with the store and the result).

Placement Preferences of Partition (Preferred Locations) 


—  getPreferredLocations Method

getPreferredLocations(partition: Partition): Seq[String]

getPreferredLocations is a part of the RDD Contract to specify placement


Note preferences (aka preferred task locations), i.e. where tasks should be executed
to be as close to the data as possible.

getPreferredLocations creates a StateStoreProviderId with StateStoreId (using

checkpointLocation, operatorId and the index of the input partition ) and queryRunId.

checkpointLocation and operatorId are shared across different partitions and so


Note
the only difference in StateStoreProviderIds is the partition index.

In the end, getPreferredLocations requests StateStoreCoordinatorRef for the location of the


state store for the StateStoreProviderId.

StateStoreCoordinator coordinates instances of StateStores across Spark


Note
executors in the cluster, and tracks their locations for job scheduling.

Creating StateStoreRDD Instance


StateStoreRDD takes the following to be created:

Data RDD ( RDD[T] to update the aggregates in a state store)

118
StateStoreRDD

Store update function ( (StateStore, Iterator[T]) ⇒ Iterator[U] where T is the type


of rows in the data RDD)

Checkpoint directory

Run ID of the streaming query

Operator ID

Version of the store

Key schema - schema of the keys

Value schema - schema of the values

Index

SessionState

Optional StateStoreCoordinatorRef

StateStoreRDD initializes the internal properties.

Internal Properties

Name Description
hadoopConfBroadcast

storeConf
Configuration parameters (as StateStoreConf ) using the
current SQLConf (from SessionState )

119
StateStoreOps

StateStoreOps — Extension Methods for


Creating StateStoreRDD
StateStoreOps is a Scala implicit class of a data RDD (of type RDD[T] ) to create a

StateStoreRDD for the following physical operators:

FlatMapGroupsWithStateExec

StateStoreRestoreExec

StateStoreSaveExec

StreamingDeduplicateExec

Implicit Classes are a language feature in Scala for implicit conversions with
Note
extension methods for existing types.

Creating StateStoreRDD (with storeUpdateFunction


Aborting StateStore When Task Fails) 
—  mapPartitionsWithStateStore Method

mapPartitionsWithStateStore[U](
stateInfo: StatefulOperatorStateInfo,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int],
sessionState: SessionState,
storeCoordinator: Option[StateStoreCoordinatorRef])(
storeUpdateFunction: (StateStore, Iterator[T]) => Iterator[U]): StateStoreRDD[T, U]
// Used for testing only
mapPartitionsWithStateStore[U](
sqlContext: SQLContext,
stateInfo: StatefulOperatorStateInfo,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int])(
storeUpdateFunction: (StateStore, Iterator[T]) => Iterator[U]): StateStoreRDD[T, U] (
1)

1. Uses sqlContext.streams.stateStoreCoordinator to access StateStoreCoordinator

Internally, mapPartitionsWithStateStore requests SparkContext to clean


storeUpdateFunction function.

120
StateStoreOps

mapPartitionsWithStateStore uses the enclosing RDD to access the current


Note
SparkContext .

Function Cleaning is to clean a closure from unreferenced variables before it


Note is serialized and sent to tasks. SparkContext reports a SparkException when
the closure is not serializable.

mapPartitionsWithStateStore then creates a (wrapper) function to abort the StateStore if

state updates had not been committed before a task finished (which is to make sure that the
StateStore has been committed or aborted in the end to follow the contract of

StateStore ).

mapPartitionsWithStateStore uses TaskCompletionListener to be notified when


Note
a task has finished.

In the end, mapPartitionsWithStateStore creates a StateStoreRDD (with the wrapper


function, SessionState and StateStoreCoordinatorRef).

mapPartitionsWithStateStore is used when the following physical operators are


executed:
FlatMapGroupsWithStateExec
StateStoreRestoreExec
Note
StateStoreSaveExec

StreamingDeduplicateExec
StreamingGlobalLimitExec

121
StreamingAggregationStateManager

StreamingAggregationStateManager Contract 
— State Managers for Streaming Aggregation
StreamingAggregationStateManager is the abstraction of state managers that act as

middlemen between state stores and the physical operators used in Streaming Aggregation
(e.g. StateStoreSaveExec and StateStoreRestoreExec).

Table 1. StreamingAggregationStateManager Contract


Method Description

commit(
store: StateStore): Long

commit Commits all updates (changes) to the given state store


and returns the new version

Used exclusively when StateStoreSaveExec physical


operator is executed.

get(store: StateStore, key: UnsafeRow): UnsafeRow

get Looks up the value of the key from the state store (the
key is non- null )
Used exclusively when StateStoreRestoreExec physical
operator is executed.

getKey(row: UnsafeRow): UnsafeRow

Extracts the columns for the key from the input row

Used when:
getKey
StateStoreRestoreExec physical operator is
executed

StreamingAggregationStateManagerImplV1 legacy
state manager is requested to put a row to a state
store

getStateValueSchema: StructType

getStateValueSchema Gets the schema of the values in a state store

122
StreamingAggregationStateManager

Used when StateStoreRestoreExec and


StateStoreSaveExec physical operators are executed

iterator(
store: StateStore): Iterator[UnsafeRowPair]

iterator Returns all UnsafeRow key-value pairs in the given state


store

Used exclusively when StateStoreSaveExec physical


operator is executed.

keys(store: StateStore): Iterator[UnsafeRow]

Returns all the keys in the state store


keys
Used exclusively when physical operators with
WatermarkSupport are requested to
removeKeysOlderThanWatermark (i.e. exclusively when
StateStoreSaveExec physical operator is executed).

put(
store: StateStore,
row: UnsafeRow): Unit

put
Stores (puts) the given row in the given state store
Used exclusively when StateStoreSaveExec physical
operator is executed.

remove(
store: StateStore,
key: UnsafeRow): Unit

remove Removes the key-value pair from the given state store
per key

Used exclusively when StateStoreSaveExec physical


operator is executed (directly or indirectly as a
WatermarkSupport)

values(
store: StateStore): Iterator[UnsafeRow]

values
All values in the state store

Used exclusively when StateStoreSaveExec physical


operator is executed.

123
StreamingAggregationStateManager

StreamingAggregationStateManager supports two versions of state managers for streaming

aggregations (per the spark.sql.streaming.aggregation.stateFormatVersion internal


configuration property):

1 (for the legacy StreamingAggregationStateManagerImplV1)

2 (for the default StreamingAggregationStateManagerImplV2)

StreamingAggregationStateManagerBaseImpl is the one and only known direct


Note implementation of the StreamingAggregationStateManager Contract in Spark
Structured Streaming.

StreamingAggregationStateManager is a Scala sealed trait which means that all


Note
the implementations are in the same compilation unit (a single file).

Creating StreamingAggregationStateManager Instance 


—  createStateManager Factory Method

createStateManager(
keyExpressions: Seq[Attribute],
inputRowAttributes: Seq[Attribute],
stateFormatVersion: Int): StreamingAggregationStateManager

createStateManager creates a new StreamingAggregationStateManager for a given

stateFormatVersion :

StreamingAggregationStateManagerImplV1 for stateFormatVersion being 1

StreamingAggregationStateManagerImplV2 for stateFormatVersion being 2

createStateManager throws a IllegalArgumentException for any other stateFormatVersion :

Version [stateFormatVersion] is invalid

createStateManager is used when StateStoreRestoreExec and


Note
StateStoreSaveExec physical operators are created.

124
StreamingAggregationStateManagerBaseImpl

StreamingAggregationStateManagerBaseImpl 
— Base State Manager for Streaming
Aggregation
StreamingAggregationStateManagerBaseImpl is the base implementation of the

StreamingAggregationStateManager contract for state managers for streaming aggregations


that use UnsafeProjection to getKey.

StreamingAggregationStateManagerBaseImpl uses UnsafeProjection to getKey.

Table 1. StreamingAggregationStateManagerBaseImpls
StreamingAggregationStateManagerBaseImpl Description
Legacy StreamingAggregationStateManager
when
StreamingAggregationStateManagerImplV1
spark.sql.streaming.aggregation.stateFormatVersi
configuration property is 1 )

Default StreamingAggregationStateManager
when
StreamingAggregationStateManagerImplV2
spark.sql.streaming.aggregation.stateFormatVersi
configuration property is 2 )

StreamingAggregationStateManagerBaseImpl takes the following to be created:

Catalyst expressions for the keys ( Seq[Attribute] )

Catalyst expressions for the input rows ( Seq[Attribute] )

StreamingAggregationStateManagerBaseImpl is a Scala abstract class and cannot


Note be created directly. It is created indirectly for the concrete
StreamingAggregationStateManagerBaseImpls.

Committing (Changes to) State Store —  commit Method

commit(
store: StateStore): Long

commit is part of the StreamingAggregationStateManager Contract to commit


Note
changes to a state store.

commit simply requests the state store to commit state changes.

125
StreamingAggregationStateManagerBaseImpl

Removing Key From State Store —  remove Method

remove(store: StateStore, key: UnsafeRow): Unit

remove is part of the StreamingAggregationStateManager Contract to remove


Note
a key from a state store.

remove …​FIXME

getKey Method

getKey(row: UnsafeRow): UnsafeRow

Note getKey is part of the StreamingAggregationStateManager Contract to…​FIXME

getKey …​FIXME

Getting All Keys in State Store —  keys Method

keys(store: StateStore): Iterator[UnsafeRow]

keys is part of the StreamingAggregationStateManager Contract to get all


Note
keys in a state store (as an iterator).

keys …​FIXME

126
StreamingAggregationStateManagerImplV1

StreamingAggregationStateManagerImplV1 — 
Legacy State Manager for Streaming
Aggregation
StreamingAggregationStateManagerImplV1 is the legacy state manager for streaming

aggregations.

The version of a state manager is controlled using


Note spark.sql.streaming.aggregation.stateFormatVersion internal configuration
property.

StreamingAggregationStateManagerImplV1 is created exclusively when

StreamingAggregationStateManager is requested for a new

StreamingAggregationStateManager.

Storing Row in State Store —  put Method

put(store: StateStore, row: UnsafeRow): Unit

put is part of the StreamingAggregationStateManager Contract to store a row


Note
in a state store.

put …​FIXME

Creating StreamingAggregationStateManagerImplV1
Instance
StreamingAggregationStateManagerImplV1 takes the following when created:

Attribute expressions for keys ( Seq[Attribute] )

Attribute expressions of input rows ( Seq[Attribute] )

127
StreamingAggregationStateManagerImplV2

StreamingAggregationStateManagerImplV2 — 
Default State Manager for Streaming
Aggregation
StreamingAggregationStateManagerImplV2 is the default state manager for streaming

aggregations.

The version of a state manager is controlled using


Note spark.sql.streaming.aggregation.stateFormatVersion internal configuration
property.

StreamingAggregationStateManagerImplV2 is created exclusively when

StreamingAggregationStateManager is requested for a new

StreamingAggregationStateManager.

StreamingAggregationStateManagerImplV2 (like the parent

StreamingAggregationStateManagerBaseImpl) takes the following to be created:

Catalyst expressions for the keys ( Seq[Attribute] )

Catalyst expressions for the input rows ( Seq[Attribute] )

Storing Row in State Store —  put Method

put(store: StateStore, row: UnsafeRow): Unit

put is part of the StreamingAggregationStateManager Contract to store a row


Note
in a state store.

put …​FIXME

Getting Saved State for Non-Null Key from State Store 


—  get Method

get(store: StateStore, key: UnsafeRow): UnsafeRow

get is part of the StreamingAggregationStateManager Contract to get the


Note
saved state for a given non-null key from a given state store.

get requests the given StateStore for the current state value for the given key.

128
StreamingAggregationStateManagerImplV2

get returns null if the key could not be found in the state store. Otherwise, get

restoreOriginalRow (for the key and the saved state).

restoreOriginalRow Internal Method

restoreOriginalRow(key: UnsafeRow, value: UnsafeRow): UnsafeRow


restoreOriginalRow(rowPair: UnsafeRowPair): UnsafeRow

restoreOriginalRow …​FIXME

restoreOriginalRow is used when StreamingAggregationStateManagerImplV2 is


Note requested to get the saved state for a given non-null key from a state store,
iterator and values.

getStateValueSchema Method

getStateValueSchema: StructType

getStateValueSchema is part of the StreamingAggregationStateManager


Note
Contract to…​FIXME.

getStateValueSchema simply requests the valueExpressions for the schema.

iterator Method

iterator: iterator(store: StateStore): Iterator[UnsafeRowPair]

iterator is part of the StreamingAggregationStateManager Contract to…​


Note
FIXME.

iterator simply requests the input state store for the iterator that is mapped to an iterator

of UnsafeRowPairs with the key (of the input UnsafeRowPair ) and the value as a restored
original row.

scala.collection.Iterator is a data structure that allows to iterate over a sequence


Note of elements that are usually fetched lazily (i.e. no elements are fetched from the
underlying store until processed).

values Method

129
StreamingAggregationStateManagerImplV2

values(store: StateStore): Iterator[UnsafeRow]

Note values is part of the StreamingAggregationStateManager Contract to…​FIXME.

values …​FIXME

Internal Properties
Name Description
joiner

keyValueJoinedExpressions

needToProjectToRestoreValue

restoreValueProjector

valueExpressions

valueProjector

130
Stateful Stream Processing

Stateful Stream Processing


Stateful Stream Processing is a stream processing with state (implicit or explicit).

In Spark Structured Streaming, a streaming query is stateful when is one of the following
(that makes use of StateStores):

Streaming Aggregation

Arbitrary Stateful Streaming Aggregation

Stream-Stream Join

Streaming Deduplication

Streaming Limit

Versioned State, StateStores and StateStoreProviders


Spark Structured Streaming uses StateStores for versioned and fault-tolerant key-value
state stores.

State stores are checkpointed incrementally to avoid state loss and for increased
performance.

State stores are managed by State Store Providers with HDFSBackedStateStoreProvider


being the default and only known implementation. HDFSBackedStateStoreProvider uses
Hadoop DFS-compliant file system for state checkpointing and fault-tolerance.

State store providers manage versioned state per stateful operator (and partition it operates
on).

The lifecycle of a StateStoreProvider begins when StateStore utility (on a Spark executor)
is requested for the StateStore by provider ID and version.

It is worth to notice that since StateStore and StateStoreProvider utilities


are Scala objects that makes it possible that there can only be one
instance of StateStore and StateStoreProvider on a single JVM. Scala
Important objects are (sort of) singletons which means that there will be exactly one
instance of each per JVM and that is exactly the JVM of a Spark executor.
As long as the executor is up and running state versions are cached and
no Hadoop DFS is used (except for the initial load).

131
Stateful Stream Processing

When requested for a StateStore, StateStore utility is given the version of a state store to
look up. The version is either the current epoch (in Continuous Stream Processing) or the
current batch ID (in Micro-Batch Stream Processing).

StateStore utility requests StateStoreProvider utility to createAndInit that creates the

StateStoreProvider implementation (based on spark.sql.streaming.stateStore.providerClass

internal configuration property) and requests it to initialize.

The initialized StateStoreProvider is cached in loadedProviders internal lookup table (for a


StateStoreId) for later lookups.

StateStoreProvider utility then requests the StateStoreProvider for the state store for a

specified version. (e.g. a HDFSBackedStateStore in case of


HDFSBackedStateStoreProvider).

An instance of StateStoreProvider is requested to do its own maintenance or close (when a


corresponding StateStore is inactive) in MaintenanceTask daemon thread that runs
periodically every spark.sql.streaming.stateStore.maintenanceInterval configuration property
(default: 60s ).

IncrementalExecution — QueryExecution of Streaming
Queries
Regardless of the query language (Dataset API or SQL), any structured query (incl.
streaming queries) becomes a logical query plan.

In Spark Structured Streaming it is IncrementalExecution that plans streaming queries for


execution.

While planning a streaming query for execution (aka query planning), IncrementalExecution
uses the state preparation rule. The rule fills out the following physical operators with the
execution-specific configuration (with StatefulOperatorStateInfo being the most important for
stateful stream processing):

FlatMapGroupsWithStateExec

StateStoreRestoreExec

StateStoreSaveExec (used for streaming aggregation)

StreamingDeduplicateExec

StreamingGlobalLimitExec

StreamingSymmetricHashJoinExec

132
Stateful Stream Processing

Micro-Batch Stream Processing and Extra Non-Data Batch


for StateStoreWriter Stateful Operators
In Micro-Batch Stream Processing (with MicroBatchExecution engine),
IncrementalExecution uses shouldRunAnotherBatch flag that allows StateStoreWriters

stateful physical operators to indicate whether the last batch execution requires another non-
data batch.

The following table shows the StateStoreWriters that redefine shouldRunAnotherBatch flag.

Table 1. StateStoreWriters and shouldRunAnotherBatch Flag


StateStoreWriter shouldRunAnotherBatch Flag
FlatMapGroupsWithStateExec Based on GroupStateTimeout

Based on OutputMode and event-time


StateStoreSaveExec
watermark

StreamingDeduplicateExec Based on event-time watermark

StreamingSymmetricHashJoinExec Based on event-time watermark

StateStoreRDD
Right after query planning, a stateful streaming query (a single micro-batch actually)
becomes an RDD with one or more StateStoreRDDs.

You can find the StateStoreRDDs of a streaming query in the RDD lineage.

scala> :type streamingQuery


org.apache.spark.sql.streaming.StreamingQuery

scala> streamingQuery.explain
== Physical Plan ==
*(4) HashAggregate(keys=[window#13-T0ms, value#3L], functions=[count(1)])
+- StateStoreSave [window#13-T0ms, value#3L], state info [ checkpoint = file:/tmp/chec
kpoint-counts/state, runId = 1dec2d81-f2d0-45b9-8f16-39ede66e13e7, opId = 0, ver = 1,
numPartitions = 1], Append, 10000, 2
+- *(3) HashAggregate(keys=[window#13-T0ms, value#3L], functions=[merge_count(1)])
+- StateStoreRestore [window#13-T0ms, value#3L], state info [ checkpoint = file:
/tmp/checkpoint-counts/state, runId = 1dec2d81-f2d0-45b9-8f16-39ede66e13e7, opId = 0,
ver = 1, numPartitions = 1], 2
+- *(2) HashAggregate(keys=[window#13-T0ms, value#3L], functions=[merge_count(
1)])
+- Exchange hashpartitioning(window#13-T0ms, value#3L, 1)
+- *(1) HashAggregate(keys=[window#13-T0ms, value#3L], functions=[parti
al_count(1)])

133
Stateful Stream Processing

+- *(1) Project [named_struct(start, precisetimestampconversion(((((


CASE WHEN (cast(CEIL((cast((precisetimestampconversion(time#2-T0ms, TimestampType, Lon
gType) - 0) as double) / 5000000.0)) as double) = (cast((precisetimestampconversion(ti
me#2-T0ms, TimestampType, LongType) - 0) as double) / 5000000.0)) THEN (CEIL((cast((pr
ecisetimestampconversion(time#2-T0ms, TimestampType, LongType) - 0) as double) / 50000
00.0)) + 1) ELSE CEIL((cast((precisetimestampconversion(time#2-T0ms, TimestampType, Lo
ngType) - 0) as double) / 5000000.0)) END + 0) - 1) * 5000000) + 0), LongType, Timesta
mpType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimest
ampconversion(time#2-T0ms, TimestampType, LongType) - 0) as double) / 5000000.0)) as d
ouble) = (cast((precisetimestampconversion(time#2-T0ms, TimestampType, LongType) - 0)
as double) / 5000000.0)) THEN (CEIL((cast((precisetimestampconversion(time#2-T0ms, Tim
estampType, LongType) - 0) as double) / 5000000.0)) + 1) ELSE CEIL((cast((precisetimes
tampconversion(time#2-T0ms, TimestampType, LongType) - 0) as double) / 5000000.0)) END
+ 0) - 1) * 5000000) + 5000000), LongType, TimestampType)) AS window#13-T0ms, value#3
L]
+- *(1) Filter isnotnull(time#2-T0ms)
+- EventTimeWatermark time#2: timestamp, interval
+- LocalTableScan <empty>, [time#2, value#3L]

import org.apache.spark.sql.execution.streaming.{StreamExecution, StreamingQueryWrapper


}
val se = streamingQuery.asInstanceOf[StreamingQueryWrapper].streamingQuery

scala> :type se
org.apache.spark.sql.execution.streaming.StreamExecution

scala> :type se.lastExecution


org.apache.spark.sql.execution.streaming.IncrementalExecution

val rdd = se.lastExecution.toRdd


scala> rdd.toDebugString
res3: String =
(1) MapPartitionsRDD[39] at toRdd at <console>:40 []
| StateStoreRDD[38] at toRdd at <console>:40 [] // <-- here
| MapPartitionsRDD[37] at toRdd at <console>:40 []
| StateStoreRDD[36] at toRdd at <console>:40 [] // <-- here
| MapPartitionsRDD[35] at toRdd at <console>:40 []
| ShuffledRowRDD[17] at start at <pastie>:67 []
+-(1) MapPartitionsRDD[16] at start at <pastie>:67 []
| MapPartitionsRDD[15] at start at <pastie>:67 []
| MapPartitionsRDD[14] at start at <pastie>:67 []
| MapPartitionsRDD[13] at start at <pastie>:67 []
| ParallelCollectionRDD[12] at start at <pastie>:67 []

StateStoreCoordinator RPC Endpoint, StateStoreRDD and


Preferred Locations

134
Stateful Stream Processing

Since execution of a stateful streaming query happens on Spark executors whereas


planning is on the driver, Spark Structured Streaming uses RPC environment for tracking
locations of the state stores in use. That makes the tasks (of a structured query) to be
scheduled where the state (of a partition) is.

When planned for execution, the StateStoreRDD is first asked for the preferred locations of a
partition (which happens on the driver) that are later used to compute it (on Spark
executors).

Spark Structured Streaming uses RPC environment to keep track of StateStores (their
StateStoreProvider actually) for RDD planning.

Every time StateStoreRDD is requested for the preferred locations of a partition, it


communicates with the StateStoreCoordinator RPC endpoint that knows the locations of the
required StateStores (per host and executor ID).

StateStoreRDD uses StateStoreProviderId with StateStoreId to uniquely identify the state

store to use for (associate with) a stateful operator and a partition.

State Management
The state in a stateful streaming query can be implicit or explicit.

135
Streaming Watermark

Streaming Watermark
Streaming Watermark of a stateful streaming query is how long to wait for late and possibly
out-of-order events until a streaming state can be considered final and not to change.
Streaming watermark is used to mark events (modeled as a row in the streaming Dataset)
that are older than the threshold as "too late", and not "interesting" to update partial non-final
streaming state.

In Spark Structured Streaming, streaming watermark is defined using


Dataset.withWatermark high-level operator.

withWatermark(
eventTime: String,
delayThreshold: String): Dataset[T]

In Dataset.withWatermark operator, eventTime is the name of the column to use to monitor


event time whereas delayThreshold is a delay threshold.

Watermark Delay says how late and possibly out-of-order events are still acceptable and
contribute to the final result of a stateful streaming query. Event-time watermark delay is
used to calculate the difference between the event time of an event and the time in the past.

Event-Time Watermark is then a time threshold (point in time) that is the minimum
acceptable time of an event (modeled as a row in the streaming Dataset) that is accepted in
a stateful streaming query.

With streaming watermark, memory usage of a streaming state can be controlled as late
events can easily be dropped, and old state (e.g. aggregates or join) that are never going to
be updated removed. That avoids unbounded streaming state that would inevitably use up
all the available memory of long-running streaming queries and end up in out of memory
errors.

In Append output mode the current event-time streaming watermark is used for the
following:

Output saved state rows that became expired (Expired events in the demo)

Dropping late events, i.e. don’t save them to a state store or include in aggregation
(Late events in the demo)

Streaming watermark is required for a streaming aggregation in append output mode.

136
Streaming Watermark

Streaming Aggregation
In streaming aggregation, a streaming watermark has to be defined on one or many
grouping expressions of a streaming aggregation (directly or using window standard
function).

Dataset.withWatermark operator has to be used before an aggregation operator


Note
(for the watermark to have an effect).

Streaming Join
In streaming join, a streaming watermark can be defined on join keys or any of the join
sides.

Demos
Use the following demos to learn more:

Demo: Streaming Watermark with Aggregation in Append Output Mode

Internals
Under the covers, Dataset.withWatermark high-level operator creates a logical query plan
with EventTimeWatermark logical operator.

EventTimeWatermark logical operator is planned to EventTimeWatermarkExec physical

operator that extracts the event times (from the data being processed) and adds them to an
accumulator.

Since the execution (data processing) happens on Spark executors, using the accumulator
is the only Spark-approved way for communication between the tasks (on the executors)
and the driver. Using accumulator updates the driver with the current event-time watermark.

During the query planning phase (in MicroBatchExecution and ContinuousExecution) that
also happens on the driver, IncrementalExecution is given the current OffsetSeqMetadata
with the current event-time watermark.

Further Reading Or Watching


SPARK-18124 Observed delay based event time watermarks

137
Streaming Watermark

138
Streaming Deduplication

Streaming Deduplication
Streaming Deduplication is…​FIXME

139
Streaming Limit

Streaming Limit
Streaming Limit is…​FIXME

140
StateStore

StateStore Contract — Kay-Value Store for


Streaming State Data
StateStore is the abstraction of key-value stores for managing state in Stateful Stream

Processing (e.g. for persisting running aggregates in Streaming Aggregation).

StateStore supports incremental checkpointing in which only the key-value "Row" pairs

that changed are committed or aborted (without touching other key-value pairs).

StateStore is identified with the aggregating operator id and the partition id (among other

properties for identification).

HDFSBackedStateStore is the default and only known implementation of the


Note
StateStore Contract in Spark Structured Streaming.

Table 1. StateStore Contract


Method Description

abort(): Unit

Aborts (discards) changes to the state store


Used when:
StateStoreOps implicit class is requested to
abort mapPartitionsWithStateStore (when the state store has
not been committed for a task that finishes, possibly
with an error)
StateStoreHandler (of
SymmetricHashJoinStateManager) is requested to
abortIfNeeded (when the state store has not been
committed for a task that finishes, possibly with an
error)

commit(): Long

Commits the changes to the state store (and returns the


current version)

Used when:
FlatMapGroupsWithStateExec,
StreamingDeduplicateExec and
commit
StreamingGlobalLimitExec physical operators are
executed (right after all rows in a partition have been
processed)

141
StateStore

StreamingAggregationStateManagerBaseImpl is requested
to commit (changes to) a state store (exclusively when
StateStoreSaveExec physical operator is executed)
StateStoreHandler (of
SymmetricHashJoinStateManager) is requested to
commit changes to a state store

get(key: UnsafeRow): UnsafeRow

Looks up (gets) the value of the given non- null key

Used when:
StreamingDeduplicateExec and
StreamingGlobalLimitExec physical operators are
executed
get
StateManagerImplBase (of
FlatMapGroupsWithStateExecHelper ) is requested to
getState

StreamingAggregationStateManagerImplV1 and
StreamingAggregationStateManagerImplV2 are
requested to get the value of a non-null key
KeyToNumValuesStore is requested to get

KeyWithIndexToValueStore` is requested to get and


getAll

getRange(
start: Option[UnsafeRow],
end: Option[UnsafeRow]): Iterator[UnsafeRowPair]

Gets the key-value pairs of UnsafeRows for the specified


range (with optional approximate start and end extents)

Used when:

WatermarkSupport is requested to
getRange removeKeysOlderThanWatermark
StateManagerImplBase is requested to getAllState

StreamingAggregationStateManagerBaseImpl is requested
to keys
KeyToNumValuesStore and KeyWithIndexToValueStore
are requested to iterator

All the uses above assume the start and


Note
end as None that basically is iterator.

142
StateStore

hasCommitted: Boolean

Flag to indicate whether state changes have been


committed ( true ) or not ( false )

Used when:
hasCommitted
RDD (via StateStoreOps implicit class) is requested to
mapPartitionsWithStateStore (and a task finishes and
may need to abort state updates)
SymmetricHashJoinStateManager is requested to
abortIfNeeded (when a task finishes and may need to
abort state updates))

id: StateStoreId

The ID of the state store


Used when:
id
HDFSBackedStateStore state store is requested for the
textual representation
StateStoreHandler (of
SymmetricHashJoinStateManager) is requested to
abortIfNeeded and getStateStore

iterator(): Iterator[UnsafeRowPair]

Returns an iterator with all the kay-value pairs in the state


store
Used when:

StateStoreRestoreExec physical operator is requested


iterator
to execute

HDFSBackedStateStore state store in particular and


any StateStore in general are requested to getRange
StreamingAggregationStateManagerImplV1 state manager
is requested for the iterator and values

StreamingAggregationStateManagerImplV2 state manager


is requested to iterator and values

metrics: StateStoreMetrics

StateStoreMetrics of the state store

143
StateStore

metrics
Used when:
StateStoreWriter stateful physical operator is
requested to setStoreMetrics

StateStoreHandler (of
SymmetricHashJoinStateManager) is requested to
commit and for the metrics

put(
key: UnsafeRow,
value: UnsafeRow): Unit

Stores (puts) the value for the (non-null) key

Used when:
StreamingDeduplicateExec and
put StreamingGlobalLimitExec physical operators are
executed
StateManagerImplBase is requested to putState

StreamingAggregationStateManagerImplV1 and
StreamingAggregationStateManagerImplV2 are
requested to store a row in a state store

KeyToNumValuesStore and KeyWithIndexToValueStore


are requested to store a new value for a given key

remove(key: UnsafeRow): Unit

Removes the (non-null) key from the state store

Used when:
Physical operators with WatermarkSupport are
requested to removeKeysOlderThanWatermark
remove

StateManagerImplBase is requested to removeState

StreamingAggregationStateManagerBaseImpl is requested
to remove a key from a state store
KeyToNumValuesStore is requested to remove a key

KeyWithIndexToValueStore is requested to remove a key


and removeAllValues

version: Long

Version of the state store


version

144
StateStore

Used exclusively when HDFSBackedStateStore state store is


requested for a new version (that simply the current version
incremented)

StateStore was introduced in [SPARK-13809][SQL] State store for streaming


aggregations.
Note
Read the motivation and design in State Store for Streaming Aggregations.

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.state.StateStore$ logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.state.StateStore$=ALL

Refer to Logging.

Creating (and Caching) RPC Endpoint Reference to


StateStoreCoordinator for Executors —  coordinatorRef
Internal Object Method

coordinatorRef: Option[StateStoreCoordinatorRef]

coordinatorRef requests the SparkEnv helper object for the current SparkEnv .

If the SparkEnv is available and the _coordRef is not assigned yet, coordinatorRef prints
out the following DEBUG message to the logs followed by requesting the
StateStoreCoordinatorRef for the StateStoreCoordinator endpoint.

Getting StateStoreCoordinatorRef

If the SparkEnv is available, coordinatorRef prints out the following INFO message to the
logs:

Retrieved reference to StateStoreCoordinator: [_coordRef]

coordinatorRef is used when StateStore helper object is requested to


reportActiveStoreInstance (when StateStore object helper is requested to find
Note
the StateStore by StateStoreProviderId) and verifyIfStoreInstanceActive (when
StateStore object helper is requested to doMaintenance).

145
StateStore

Unloading State Store Provider —  unload Method

unload(storeProviderId: StateStoreProviderId): Unit

unload …​FIXME

unload is used when StateStore helper object is requested to stop and


Note
doMaintenance.

stop Object Method

stop(): Unit

stop …​FIXME

Note stop seems only be used in tests.

Announcing New StateStoreProvider 


—  reportActiveStoreInstance Internal Object
Method

reportActiveStoreInstance(
storeProviderId: StateStoreProviderId): Unit

reportActiveStoreInstance takes the current host and executorId (from the BlockManager

on the Spark executor) and requests the StateStoreCoordinatorRef to reportActiveInstance.

Note reportActiveStoreInstance uses SparkEnv to access the BlockManager .

In the end, reportActiveStoreInstance prints out the following INFO message to the logs:

Reported that the loaded instance [storeProviderId] is active

reportActiveStoreInstance is used exclusively when StateStore utility is


Note
requested to find the StateStore by StateStoreProviderId.

MaintenanceTask Daemon Thread


MaintenanceTask is a daemon thread that triggers maintenance work of registered

StateStoreProviders.

146
StateStore

When an error occurs, MaintenanceTask clears loadedProviders internal registry.

MaintenanceTask is scheduled on state-store-maintenance-task thread pool that runs

periodically every spark.sql.streaming.stateStore.maintenanceInterval (default: 60s ).

Looking Up StateStore by Provider ID —  get Object


Method

get(
storeProviderId: StateStoreProviderId,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int],
version: Long,
storeConf: StateStoreConf,
hadoopConf: Configuration): StateStore

get finds StateStore for the specified StateStoreProviderId and version.

The version is either the current epoch (in Continuous Stream Processing) or
Note
the current batch ID (in Micro-Batch Stream Processing).

Internally, get looks up the StateStoreProvider (by storeProviderId ) in the


loadedProviders internal cache. If unavailable, get uses the StateStoreProvider utility to
create and initialize one.

get will also start the periodic maintenance task (unless already started) and announce the

new StateStoreProvider.

In the end, get requests the StateStoreProvider to look up the StateStore by the specified
version.

get is used when:

StateStoreRDD is requested to compute a partition


Note
StateStoreHandler (of SymmetricHashJoinStateManager) is requested to
look up a StateStore (by key and value schemas)

Starting Periodic Maintenance Task (Unless Already


Started) —  startMaintenanceIfNeeded Internal Object
Method

startMaintenanceIfNeeded(): Unit

147
StateStore

startMaintenanceIfNeeded schedules MaintenanceTask to start after and every

spark.sql.streaming.stateStore.maintenanceInterval (defaults to 60s ).

startMaintenanceIfNeeded does nothing when the maintenance task has


Note
already been started and is still running.

startMaintenanceIfNeeded is used exclusively when StateStore is requested


Note
to find the StateStore by StateStoreProviderId.

Doing State Maintenance of Registered State Store


Providers —  doMaintenance Internal Object Method

doMaintenance(): Unit

Internally, doMaintenance prints the following DEBUG message to the logs:

Doing maintenance

doMaintenance then requests every StateStoreProvider (registered in loadedProviders) to

do its own internal maintenance (only when a StateStoreProvider is still active).

When a StateStoreProvider is inactive, doMaintenance removes it from the provider registry


and prints the following INFO message to the logs:

Unloaded [provider]

Note doMaintenance is used exclusively in MaintenanceTask daemon thread.

verifyIfStoreInstanceActive Internal Object Method

verifyIfStoreInstanceActive(storeProviderId: StateStoreProviderId): Boolean

verifyIfStoreInstanceActive …​FIXME

verifyIfStoreInstanceActive is used exclusively when StateStore helper


Note object is requested to doMaintenance (from a running MaintenanceTask
daemon thread).

Internal Properties

148
StateStore

Name Description

Loaded providers internal cache, i.e. StateStoreProviders


loadedProviders per StateStoreProviderId
Used in…​FIXME

StateStoreCoordinator RPC endpoint (a RpcEndpointRef to


_coordRef StateStoreCoordinator)
Used in…​FIXME

149
StateStoreId

StateStoreId — Unique Identifier of State Store


StateStoreId is a unique identifier of a state store with the following attributes:

Checkpoint Root Location - the root directory for state checkpointing

Operator ID - a unique ID of the stateful operator

Partition ID - the index of the partition

Store Name - the name of the state store (default: default)

StateStoreId is created when:

StateStoreRDD is requested for the preferred locations of a partition (executed on the

driver) and to compute it (later on an executor)

StateStoreProviderId helper object is requested to create a StateStoreProviderId (with

a StateStoreId and the run ID of a streaming query) that is then used for the preferred
locations of a partition of a StateStoreAwareZipPartitionsRDD (executed on the driver)
and to…​FIXME

The name of the default state store (for reading state store data that was generated before
store names were used, i.e. in Spark 2.2 and earlier) is default.

State Checkpoint Base Directory of Stateful Operator 


—  storeCheckpointLocation Method

storeCheckpointLocation(): Path

storeCheckpointLocation is Hadoop DFS’s Path of the checkpoint location (for the stateful

operator by operator ID, the partition by the partition ID in the checkpoint root location).

If the default store name is used (for Spark 2.2 and earlier), the storeName is not included in
the path.

storeCheckpointLocation is used exclusively when


Note HDFSBackedStateStoreProvider is requested for the state checkpoint base
directory.

150
HDFSBackedStateStore

HDFSBackedStateStore — State Store on
HDFS-Compatible File System
HDFSBackedStateStore is a concrete StateStore that uses a Hadoop DFS-compatible file

system for versioned state persistence.

HDFSBackedStateStore is created exclusively when HDFSBackedStateStoreProvider is

requested for the specified version of state (store) for update (when StateStore utility is
requested to look up a StateStore by provider id).

HDFSBackedStateStore uses the StateStoreId of the owning

HDFSBackedStateStoreProvider.

When requested for the textual representation, HDFSBackedStateStore gives


HDFSStateStore[id=(op=[operatorId],part=[partitionId]),dir=[baseDir]].

HDFSBackedStateStore is an internal class of HDFSBackedStateStoreProvider


Tip and uses its logger.

Creating HDFSBackedStateStore Instance


HDFSBackedStateStore takes the following to be created:

Version

State Map ( ConcurrentHashMap[UnsafeRow, UnsafeRow] )

HDFSBackedStateStore initializes the internal properties.

Internal State —  state Internal Property

state: STATE

state is the current state of HDFSBackedStateStore and can be in one of the three possible

states: ABORTED, COMMITTED, and UPDATING.

State changes (to the internal mapToUpdate registry) are allowed as long as
HDFSBackedStateStore is in the default UPDATING state. Right after a HDFSBackedStateStore

transitions to either COMMITTED or ABORTED state, no further state changes are allowed.

151
HDFSBackedStateStore

Don’t get confused with the term "state" as there are two states: the internal
Note state of HDFSBackedStateStore and the state of a streaming query (that
HDFSBackedStateStore is responsible for).

Table 1. Internal States


Name Description
ABORTED After abort

After commit
COMMITTED
hasCommitted flag indicates whether HDFSBackedStateStore
is in this state or not.

(default) Initial state after the HDFSBackedStateStore was


created
UPDATING
Allows for state changes (e.g. put, remove, getRange) and
eventually committing or aborting them

writeUpdateToDeltaFile Internal Method

writeUpdateToDeltaFile(
output: DataOutputStream,
key: UnsafeRow,
value: UnsafeRow): Unit

Caution FIXME

put Method

put(
key: UnsafeRow,
value: UnsafeRow): Unit

Note put is a part of StateStore Contract to…​FIXME

put stores the copies of the key and value in mapToUpdate internal registry followed by

writing them to a delta file (using tempDeltaFileStream).

put reports an IllegalStateException when HDFSBackedStateStore is not in UPDATING

state:

152
HDFSBackedStateStore

Cannot put after already committed or aborted

Committing State Changes —  commit Method

commit(): Long

Note commit is part of the StateStore Contract to commit state changes.

commit requests the parent HDFSBackedStateStoreProvider to commit state changes (as a

new version of state) (with the newVersion, the mapToUpdate and the compressed stream).

commit transitions HDFSBackedStateStore to COMMITTED state.

commit prints out the following INFO message to the logs:

Committed version [newVersion] for [this] to file [finalDeltaFile]

commit returns a newVersion.

commit throws an IllegalStateException when HDFSBackedStateStore is not in UPDATING

state:

Cannot commit after already committed or aborted

commit throws an IllegalStateException for any NonFatal exception:

Error committing version [newVersion] into [this]

Aborting State Changes —  abort Method

abort(): Unit

Note abort is part of the StateStore Contract to abort the state changes.

abort …​FIXME

Performance Metrics —  metrics Method

metrics: StateStoreMetrics

153
HDFSBackedStateStore

Note metrics is part of the StateStore Contract to get the StateStoreMetrics.

metrics requests the performance metrics of the parent HDFSBackedStateStoreProvider .

The performance metrics of the provider used are only the ones listed in
supportedCustomMetrics.

In the end, metrics returns a new StateStoreMetrics with the following:

Total number of keys as the size of mapToUpdate

Memory used (in bytes) as the memoryUsedBytes metric (of the parent provider)

StateStoreCustomMetrics as the supportedCustomMetrics and the


metricStateOnCurrentVersionSizeBytes metric of the parent provider

Are State Changes Committed? —  hasCommitted


Method

hasCommitted: Boolean

hasCommitted is part of the StateStore Contract to indicate whether state


Note
changes have been committed or not.

hasCommitted returns true when HDFSBackedStateStore is in COMMITTED state and

false otherwise.

Internal Properties

154
HDFSBackedStateStore

Name Description

compressedStream: DataOutputStream
compressedStream

The compressed java.io.DataOutputStream for the deltaFileStream

deltaFileStream deltaFileStream: CheckpointFileManager.CancellableFSDataOutputStream

finalDeltaFile: Path
finalDeltaFile

The Hadoop Path of the deltaFile for the version

newVersion: Long

newVersion
Used exclusively when HDFSBackedStateStore is requested for the
finalDeltaFile, to commit and abort

155
StateStoreProvider

StateStoreProvider Contract — State Store


Providers
StateStoreProvider is the abstraction of state store providers that manage state stores in

Stateful Stream Processing (e.g. for persisting running aggregates in Streaming


Aggregation) in stateful streaming queries.

StateStoreProvider utility uses spark.sql.streaming.stateStore.providerClass


Note internal configuration property for the name of the class of the default
StateStoreProvider implementation.

HDFSBackedStateStoreProvider is the default and only known


Note
StateStoreProvider in Spark Structured Streaming.

Table 1. StateStoreProvider Contract


Method Description

close(): Unit

close
Closes the state store provider
Used exclusively when StateStore helper object is
requested to unload a state store provider

doMaintenance(): Unit = {}

doMaintenance Optional state maintenance


Used exclusively when StateStore utility is requested to
perform maintenance of registered state store providers (on
a separate MaintenanceTask daemon thread)

getStore(
version: Long): StateStore

getStore
Finds the StateStore for the specified version

Used exclusively when StateStore utility is requested to


look up the StateStore by a given provider ID

156
StateStoreProvider

init(
stateStoreId: StateStoreId,
keySchema: StructType,
valueSchema: StructType,
init keyIndexOrdinal: Option[Int],
storeConfs: StateStoreConf,
hadoopConf: Configuration): Unit

Initializes the state store provider


Used exclusively when StateStoreProvider helper object is
requested to create and initialize the StateStoreProvider for
a given StateStoreId (when StateStore helper object is
requested to retrieve a StateStore by ID and version)

stateStoreId: StateStoreId

StateStoreId associated with the provider (at initialization)


stateStoreId
Used when:
HDFSBackedStateStore is requested for the unique id

HDFSBackedStateStoreProvider is created and


requested for the textual representation

supportedCustomMetrics: Seq[StateStoreCustomMetric]

StateStoreCustomMetrics of the state store provider

supportedCustomMetrics Used when:

StateStoreWriter stateful physical operators are


requested for the stateStoreCustomMetrics (when
requested for the metrics and getProgress)
HDFSBackedStateStore is requested for the metrics

Creating and Initializing StateStoreProvider 


—  createAndInit Object Method

createAndInit(
stateStoreId: StateStoreId,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int],
storeConf: StateStoreConf,
hadoopConf: Configuration): StateStoreProvider

157
StateStoreProvider

createAndInit creates a new StateStoreProvider (per

spark.sql.streaming.stateStore.providerClass internal configuration property).

createAndInit requests the StateStoreProvider to initialize.

createAndInit is used exclusively when StateStore utility is requested for the


Note
StateStore by given provider ID and version.

158
StateStoreProviderId

StateStoreProviderId — Unique Identifier of
State Store Provider
StateStoreProviderId is a unique identifier of a state store provider with the following

properties:

StateStoreId

Run ID of a streaming query (java.util.UUID)

In other words, StateStoreProviderId is a StateStoreId with the run ID that is different every
restart.

StateStoreProviderId is used by the following execution components:

StateStoreCoordinator to track the executors of state store providers (on the driver)

StateStore object to manage state store providers (on executors)

StateStoreProviderId is created (directly or using apply factory method) when:

StateStoreRDD is requested for the placement preferences of a partition and to

compute a partition

StateStoreAwareZipPartitionsRDD is requested for the preferred locations of a partition

StateStoreHandler is requested to look up a state store

Creating StateStoreProviderId —  apply Factory Method

apply(
stateInfo: StatefulOperatorStateInfo,
partitionIndex: Int,
storeName: String): StateStoreProviderId

apply simply creates a new StateStoreProviderId for the StatefulOperatorStateInfo, the

partition and the store name.

Internally, apply requests the StatefulOperatorStateInfo for the checkpoint directory (aka
checkpointLocation) and the stateful operator ID and creates a new StateStoreId (with the
partitionIndex and storeName ).

In the end, apply requests the StatefulOperatorStateInfo for the run ID of a streaming
query and creates a new StateStoreProviderId (together with the run ID).

159
StateStoreProviderId

apply is used when:

StateStoreAwareZipPartitionsRDD is requested for the preferred locations of


Note a partition
StateStoreHandler is requested to look up a state store

160
HDFSBackedStateStoreProvider

HDFSBackedStateStoreProvider — Hadoop
DFS-based StateStoreProvider
HDFSBackedStateStoreProvider is a StateStoreProvider that uses a Hadoop DFS-compatible

file system for versioned state checkpointing.

HDFSBackedStateStoreProvider is the default StateStoreProvider per the

spark.sql.streaming.stateStore.providerClass internal configuration property.

HDFSBackedStateStoreProvider is created and immediately requested to initialize when

StateStoreProvider utility is requested to create and initialize a StateStoreProvider. That is

when HDFSBackedStateStoreProvider is given the StateStoreId that uniquely identifies the


state store to use for a stateful operator and a partition.

HDFSStateStoreProvider uses HDFSBackedStateStores to manage state (one per version).

HDFSBackedStateStoreProvider manages versioned state in delta and snapshot files (and

uses a cache internally for faster access to state versions).

HDFSBackedStateStoreProvider takes no arguments to be created.

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider logger to see wha
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider=ALL

Refer to Logging.

Performance Metrics

161
HDFSBackedStateStoreProvider

Name (in web UI) Description

memoryUsedBytes Estimated size of the loadedMaps internal registry

The number of times loading the specified version of state


count of cache hit on
was successful and found (hit) the requested state version in
states cache in provider
the loadedMaps internal cache

The number of times loading the specified version of state


count of cache miss on
could not find (missed) the requested state version in the
states cache in provider
loadedMaps internal cache

estimated size of state Estimated size of the current state (of the
only on current version HDFSBackedStateStore)

State Checkpoint Base Directory —  baseDir Lazy


Internal Property

baseDir: Path

baseDir is the base directory (as Hadoop DFS’s Path) for state checkpointing (for delta and

snapshot state files).

baseDir is initialized lazily since it is not yet known when HDFSBackedStateStoreProvider is

created.

baseDir is initialized and created based on the state checkpoint base directory of the

StateStoreId when HDFSBackedStateStoreProvider is requested to initialize.

StateStoreId — Unique Identifier of State Store


As a StateStoreProvider, HDFSBackedStateStoreProvider is associated with a StateStoreId
(which is a unique identifier of the state store for a stateful operator and a partition).

HDFSBackedStateStoreProvider is given the StateStoreId at initialization (as requested by the

StateStoreProvider contract).

The StateStoreId is then used for the following:

HDFSBackedStateStore is requested for the id

HDFSBackedStateStoreProvider is requested for the textual representation and the state

checkpoint base directory

162
HDFSBackedStateStoreProvider

Textual Representation —  toString Method

toString: String

toString is part of the java.lang.Object contract for the string representation of


Note
the object.

HDFSBackedStateStoreProvider uses the StateStoreId and the state checkpoint base

directory for the textual representation:

HDFSStateStoreProvider[id = (op=[operatorId],part=[partitionId]),dir = [baseDir]]

Loading Specified Version of State (Store) For Update 


—  getStore Method

getStore(
version: Long): StateStore

getStore is part of the StateStoreProvider Contract for the StateStore for a


Note
specified version.

getStore creates a new empty state ( ConcurrentHashMap[UnsafeRow, UnsafeRow] ) and loads

the specified version of state (from internal cache or snapshot and delta files) for versions
greater than 0 .

In the end, getStore creates a new HDFSBackedStateStore for the specified version with
the new state and prints out the following INFO message to the logs:

Retrieved version [version] of [this] for update

getStore throws an IllegalArgumentException when the specified version is less than 0

(negative):

Version cannot be less than 0

deltaFile Internal Method

deltaFile(version: Long): Path

163
HDFSBackedStateStoreProvider

deltaFile simply returns the Hadoop Path of the [version].delta file in the state

checkpoint base directory.

deltaFile is used when:

Note HDFSBackedStateStore is created (and creates the final delta file)


HDFSBackedStateStoreProvider is requested to updateFromDeltaFile

snapshotFile Internal Method

snapshotFile(version: Long): Path

snapshotFile simply returns the Hadoop Path of the [version].snapshot file in the state

checkpoint base directory.

snapshotFile is used when HDFSBackedStateStoreProvider is requested to


Note
writeSnapshotFile or readSnapshotFile.

Listing All Delta And Snapshot Files In State Checkpoint


Directory —  fetchFiles Internal Method

fetchFiles(): Seq[StoreFile]

fetchFiles requests the CheckpointFileManager for all the files in the state checkpoint

directory.

For every file, fetchFiles splits the name into two parts with . (dot) as a separator (files
with more or less than two parts are simply ignored) and registers a new StoreFile for
snapshot and delta files:

For snapshot files, fetchFiles creates a new StoreFile with isSnapshot flag on
( true )

For delta files, fetchFiles creates a new StoreFile with isSnapshot flag off
( false )

Note delta files are only registered if there was no snapshot file for the version.

fetchFiles prints out the following WARN message to the logs for any other files:

Could not identify file [path] for [this]

164
HDFSBackedStateStoreProvider

In the end, fetchFiles sorts the StoreFiles based on their version, prints out the following
DEBUG message to the logs, and returns the files.

Current set of files for [this]: [storeFiles]

fetchFiles is used when HDFSBackedStateStoreProvider is requested to


Note
doSnapshot and cleanup.

Initializing StateStoreProvider —  init Method

init(
stateStoreId: StateStoreId,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int],
storeConf: StateStoreConf,
hadoopConf: Configuration): Unit

Note init is part of the StateStoreProvider Contract to initialize itself.

init records the values of the input arguments as the stateStoreId, keySchema,

valueSchema, storeConf, and hadoopConf internal properties.

init requests the given StateStoreConf for the

spark.sql.streaming.maxBatchesToRetainInMemory configuration property (that is then


recorded in the numberOfVersionsToRetainInMemory internal property).

In the end, init requests the CheckpointFileManager to create the baseDir directory (with
parent directories).

Finding Snapshot File and Delta Files For Version 


—  filesForVersion Internal Method

filesForVersion(
allFiles: Seq[StoreFile],
version: Long): Seq[StoreFile]

filesForVersion finds the latest snapshot version among the given allFiles files up to

and including the given version (it may or may not be available).

165
HDFSBackedStateStoreProvider

If a snapshot file was found (among the given file up to and including the given version),
filesForVersion takes all delta files between the version of the snapshot file (exclusive)

and the given version (inclusive) from the given allFiles files.

The number of delta files should be the given version minus the snapshot
Note
version.

If a snapshot file was not found, filesForVersion takes all delta files up to the given version
(inclusive) from the given allFiles files.

In the end, filesForVersion returns a snapshot version (if available) and all delta files up to
the given version (inclusive).

filesForVersion is used when HDFSBackedStateStoreProvider is requested to


Note
doSnapshot and cleanup.

State Maintenance (Snapshotting and Cleaning Up) 


—  doMaintenance Method

doMaintenance(): Unit

doMaintenance is part of the StateStoreProvider Contract for optional state


Note
maintenance.

doMaintenance simply does state snapshoting followed by cleaning up (removing old state

files).

In case of any non-fatal errors, doMaintenance simply prints out the following WARN
message to the logs:

Error performing snapshot and cleaning up [this]

State Snapshoting (Rolling Up Delta Files into Snapshot


File) —  doSnapshot Internal Method

doSnapshot(): Unit

doSnapshot lists all delta and snapshot files in the state checkpoint directory ( files ) and

prints out the following DEBUG message to the logs:

fetchFiles() took [time] ms.

166
HDFSBackedStateStoreProvider

doSnapshot returns immediately (and does nothing) when there are no delta and snapshot

files.

doSnapshot takes the version of the latest file ( lastVersion ).

doSnapshot finds the snapshot file and delta files for the version (among the files and for the

last version).

doSnapshot looks up the last version in the internal state cache.

When the last version was found in the cache and the number of delta files is above
spark.sql.streaming.stateStore.minDeltasForSnapshot internal threshold, doSnapshot writes
a compressed snapshot file for the last version.

In the end, doSnapshot prints out the following DEBUG message to the logs:

writeSnapshotFile() took [time] ms.

In case of non-fatal errors, doSnapshot simply prints out the following WARN message to
the logs:

Error doing snapshots for [this]

doSnapshot is used exclusively when HDFSBackedStateStoreProvider is


Note
requested to do state maintenance (state snapshotting and cleaning up).

Cleaning Up (Removing Old State Files) —  cleanup Internal


Method

cleanup(): Unit

cleanup lists all delta and snapshot files in the state checkpoint directory ( files ) and

prints out the following DEBUG message to the logs:

fetchFiles() took [time] ms.

cleanup returns immediately (and does nothing) when there are no delta and snapshot

files.

cleanup takes the version of the latest state file ( lastVersion ) and decrements it by

spark.sql.streaming.minBatchesToRetain configuration property (default: 100 ) that gives


the earliest version to retain (and all older state files to be removed).

167
HDFSBackedStateStoreProvider

cleanup requests the CheckpointFileManager to delete the path of every old state file.

cleanup prints out the following DEBUG message to the logs:

deleting files took [time] ms.

In the end, cleanup prints out the following INFO message to the logs:

Deleted files older than [version] for [this]: [filesToDelete]

In case of a non-fatal exception, cleanup prints out the following WARN message to the
logs:

Error cleaning up files for [this]

cleanup is used exclusively when HDFSBackedStateStoreProvider is requested


Note
for state maintenance (state snapshotting and cleaning up).

Closing State Store Provider —  close Method

close(): Unit

close is part of the StateStoreProvider Contract to close the state store


Note
provider.

close …​FIXME

getMetricsForProvider Method

getMetricsForProvider(): Map[String, Long]

getMetricsForProvider returns the following performance metrics:

memoryUsedBytes

metricLoadedMapCacheHit

metricLoadedMapCacheMiss

getMetricsForProvider is used exclusively when HDFSBackedStateStore is


Note
requested for performance metrics.

168
HDFSBackedStateStoreProvider

Supported StateStoreCustomMetrics 
—  supportedCustomMetrics Method

supportedCustomMetrics: Seq[StateStoreCustomMetric]

supportedCustomMetrics is part of the StateStoreProvider Contract for the


Note
StateStoreCustomMetrics of a state store provider.

supportedCustomMetrics includes the following StateStoreCustomMetrics:

metricStateOnCurrentVersionSizeBytes

metricLoadedMapCacheHit

metricLoadedMapCacheMiss

Committing State Changes (As New Version of State) 


—  commitUpdates Internal Method

commitUpdates(
newVersion: Long,
map: ConcurrentHashMap[UnsafeRow, UnsafeRow],
output: DataOutputStream): Unit

commitUpdates finalizeDeltaFile (with the given DataOutputStream ) followed by caching the

new version of state (with the given newVersion and the map state).

commitUpdates is used exclusively when HDFSBackedStateStore is requested to


Note
commit state changes.

Loading Specified Version of State (from Internal Cache or


Snapshot and Delta Files) —  loadMap Internal Method

loadMap(
version: Long): ConcurrentHashMap[UnsafeRow, UnsafeRow]

loadMap firstly tries to find the state version in the loadedMaps internal cache and, if found,

returns it immediately and increments the loadedMapCacheHitCount metric.

If the requested state version could not be found in the loadedMaps internal cache, loadMap
prints out the following WARN message to the logs:

169
HDFSBackedStateStoreProvider

The state for version [version] doesn't exist in loadedMaps.


Reading snapshot file and delta files if needed...Note that this
is normal for the first batch of starting query.

loadMap increments the loadedMapCacheMissCount metric.

loadMap tries to load the state snapshot file for the version and, if found, puts the version of

state in the internal cache and returns it.

If not found, loadMap tries to find the most recent state version by decrementing the
requested version until one is found in the loadedMaps internal cache or loaded from a state
snapshot (file).

loadMap updateFromDeltaFile for all the remaining versions (from the snapshot version up

to the requested one). loadMap puts the final version of state in the internal cache (the
closest snapshot and the remaining delta versions) and returns it.

In the end, loadMap prints out the following DEBUG message to the logs:

Loading state for [version] takes [elapsedMs] ms.

loadMap is used exclusively when HDFSBackedStateStoreProvider is requested


Note
for the specified version of a state store for update.

Loading State Snapshot File For Specified Version 


—  readSnapshotFile Internal Method

readSnapshotFile(
version: Long): Option[ConcurrentHashMap[UnsafeRow, UnsafeRow]]

readSnapshotFile creates the path of the snapshot file for the given version .

readSnapshotFile requests the CheckpointFileManager to open the snapshot file for reading

and creates a decompressed DataInputStream ( input ).

readSnapshotFile reads the decompressed input stream until an EOF (that is marked as the

integer -1 in the stream) and inserts key and value rows in a state map
( ConcurrentHashMap[UnsafeRow, UnsafeRow] ):

First integer is the size of a key (buffer) followed by the key itself (of the size).
readSnapshotFile creates an UnsafeRow for the key (with the number of fields as

indicated by the number of fields of the key schema).

170
HDFSBackedStateStoreProvider

Next integer is the size of a value (buffer) followed by the value itself (of the size).
readSnapshotFile creates an UnsafeRow for the value (with the number of fields as

indicated by the number of fields of the value schema).

In the end, readSnapshotFile prints out the following INFO message to the logs and returns
the key-value map.

Read snapshot file for version [version] of [this] from [fileToRead]

In case of FileNotFoundException readSnapshotFile simply returns None (to indicate no


snapshot state file was available and so no state for the version).

readSnapshotFile throws an IOException for the size of a key or a value below 0 :

Error reading snapshot file [fileToRead] of [this]: [key|value] size cannot be [keySiz
e|valueSize]

readSnapshotFile is used exclusively when HDFSBackedStateStoreProvider is


Note requested to load the specified version of state (from the internal cache or
snapshot and delta files).

Updating State with State Changes For Specified Version


(per Delta File) —  updateFromDeltaFile Internal
Method

updateFromDeltaFile(
version: Long,
map: ConcurrentHashMap[UnsafeRow, UnsafeRow]): Unit

updateFromDeltaFile is very similar code-wise to readSnapshotFile with the two


main differences:
updateFromDeltaFile is given the state map to update (while
readSnapshotFile loads the state from a snapshot file)
Note
updateFromDeltaFile removes a key from the state map when the value
(size) is -1 (while readSnapshotFile throws an IOException )
The following description is almost an exact copy of readSnapshotFile just for
completeness.

updateFromDeltaFile creates the path of the delta file for the requested version .

171
HDFSBackedStateStoreProvider

updateFromDeltaFile requests the CheckpointFileManager to open the delta file for reading

and creates a decompressed DataInputStream ( input ).

updateFromDeltaFile reads the decompressed input stream until an EOF (that is marked as

the integer -1 in the stream) and inserts key and value rows in the given state map:

First integer is the size of a key (buffer) followed by the key itself (of the size).
updateFromDeltaFile creates an UnsafeRow for the key (with the number of fields as

indicated by the number of fields of the key schema).

Next integer is the size of a value (buffer) followed by the value itself (of the size).
updateFromDeltaFile creates an UnsafeRow for the value (with the number of fields as

indicated by the number of fields of the value schema) or removes the corresponding
key from the state map (if the value size is -1 )

updateFromDeltaFile removes the key-value entry from the state map if the
Note
value (size) is -1 .

In the end, updateFromDeltaFile prints out the following INFO message to the logs and
returns the key-value map.

Read delta file for version [version] of [this] from [fileToRead]

updateFromDeltaFile throws an IllegalStateException in case of FileNotFoundException

while opening the delta file for the specified version:

Error reading delta file [fileToRead] of [this]: [fileToRead] does not exist

updateFromDeltaFile is used exclusively when HDFSBackedStateStoreProvider


Note is requested to load the specified version of state (from the internal cache or
snapshot and delta files).

Caching New Version of State 


—  putStateIntoStateCacheMap Internal Method

putStateIntoStateCacheMap(
newVersion: Long,
map: ConcurrentHashMap[UnsafeRow, UnsafeRow]): Unit

putStateIntoStateCacheMap registers state for a given version, i.e. adds the map state under

the newVersion key in the loadedMaps internal registry.

172
HDFSBackedStateStoreProvider

With the numberOfVersionsToRetainInMemory threshold as 0 or below,


putStateIntoStateCacheMap simply removes all entries from the loadedMaps internal registry

and returns.

putStateIntoStateCacheMap removes the oldest state version(s) in the loadedMaps internal

registry until its size is at the numberOfVersionsToRetainInMemory threshold.

With the size of the loadedMaps internal registry is at the


numberOfVersionsToRetainInMemory threshold, putStateIntoStateCacheMap does two more
optimizations per newVersion

It does not add the given state when the version of the oldest state is earlier (larger)
than the given newVersion

It removes the oldest state when older (smaller) than the given newVersion

putStateIntoStateCacheMap is used when HDFSBackedStateStoreProvider is


Note requested to commit state (as a new version) and load the specified version of
state (from the internal cache or snapshot and delta files).

Writing Compressed Snapshot File for Specified Version 


—  writeSnapshotFile Internal Method

writeSnapshotFile(
version: Long,
map: ConcurrentHashMap[UnsafeRow, UnsafeRow]): Unit

writeSnapshotFile snapshotFile for the given version.

writeSnapshotFile requests the CheckpointFileManager to create the snapshot file (with

overwriting enabled) and compress the stream.

For every key-value UnsafeRow pair in the given map, writeSnapshotFile writes the size of
the key followed by the key itself (as bytes). writeSnapshotFile then writes the size of the
value followed by the value itself (as bytes).

In the end, writeSnapshotFile prints out the following INFO message to the logs:

Written snapshot file for version [version] of [this] at [targetFile]

In case of any Throwable exception, writeSnapshotFile cancelDeltaFile and re-throws the


exception.

173
HDFSBackedStateStoreProvider

writeSnapshotFile is used exclusively when HDFSBackedStateStoreProvider is


Note
requested to doSnapshot.

compressStream Internal Method

compressStream(
outputStream: DataOutputStream): DataOutputStream

compressStream creates a new LZ4CompressionCodec (based on the SparkConf) and

requests it to create a LZ4BlockOutputStream with the given DataOutputStream .

In the end, compressStream creates a new DataOutputStream with the


LZ4BlockOutputStream .

Note compressStream is used when…​FIXME

cancelDeltaFile Internal Method

cancelDeltaFile(
compressedStream: DataOutputStream,
rawStream: CancellableFSDataOutputStream): Unit

cancelDeltaFile …​FIXME

Note cancelDeltaFile is used when…​FIXME

finalizeDeltaFile Internal Method

finalizeDeltaFile(
output: DataOutputStream): Unit

finalizeDeltaFile simply writes -1 to the given DataOutputStream (to indicate end of file)

and closes it.

finalizeDeltaFile is used exclusively when HDFSBackedStateStoreProvider is


Note
requested to commit state changes (a new version of state).

Lookup Table (Cache) of States By Version 


—  loadedMaps Internal Method

174
HDFSBackedStateStoreProvider

loadedMaps: TreeMap[
Long, // version
ConcurrentHashMap[UnsafeRow, UnsafeRow]] // state (as keys and values)

loadedMaps is a java.util.TreeMap of state versions sorted according to the reversed

ordering of the versions (i.e. long numbers).

A new entry (a version and the state updates) can only be added when
HDFSBackedStateStoreProvider is requested to putStateIntoStateCacheMap (and only when

the spark.sql.streaming.maxBatchesToRetainInMemory internal configuration is above 0 ).

loadedMaps is mainly used when HDFSBackedStateStoreProvider is requested to load the

specified version of state (from the internal cache or snapshot and delta files). Positive hits
(when a version could be found in the cache) is available as the count of cache hit on states
cache in provider performance metric while misses are counted in the count of cache miss
on states cache in provider performance metric.

With no or missing versions in cache count of cache miss on states cache in


Note provider metric should be above 0 while count of cache hit on states cache in
provider always 0 (or smaller than the other metric).

The estimated size of loadedMaps is available as the memoryUsedBytes performance


metric.

The spark.sql.streaming.maxBatchesToRetainInMemory internal configuration is used as the


threshold of the number of elements in loadedMaps . When 0 or negative, every
putStateIntoStateCacheMap removes all elements in (clears) loadedMaps .

Note It is possible to change the configuration at restart of a structured query.

The state deltas (the values) in loadedMaps are cleared (all entries removed) when
HDFSBackedStateStoreProvider is requested to close.

Used when HDFSBackedStateStoreProvider is requested for the following:

Cache a version of state

Loading the specified version of state (from the internal cache or snapshot and delta
files)

Internal Properties

175
HDFSBackedStateStoreProvider

Name Description

CheckpointFileManager for the state checkpoint


base directory (and the Hadoop Configuration)
Used when:
Creating a new HDFSBackedStateStore (to
create the CancellableFSDataOutputStream
for the finalDeltaFile)
fm
HDFSBackedStateStoreProvider is requested to
initialize (to create the state checkpoint base
directory), updateFromDeltaFile, write the
compressed snapshot file for a specified state
version, readSnapshotFile, clean up, and list
all delta and snapshot files in the state
checkpoint directory

Hadoop Configuration of the


CheckpointFileManager
hadoopConf
Given when HDFSBackedStateStoreProvider is
requested to initialize

keySchema: StructType
keySchema

Schema of the state keys

valueSchema: StructType
valueSchema

Schema of the state values

numberOfVersionsToRetainInMemory: Int

numberOfVersionsToRetainInMemory is the maximum


number of entries in the loadedMaps internal
registry and is configured by the
numberOfVersionsToRetainInMemory spark.sql.streaming.maxBatchesToRetainInMemory
internal configuration.
numberOfVersionsToRetainInMemory is a threshold
when HDFSBackedStateStoreProvider removes the
last key from the loadedMaps internal registry (per
reverse ordering of state versions) when requested
to putStateIntoStateCacheMap.

sparkConf SparkConf

176
HDFSBackedStateStoreProvider

177
StateStoreCoordinator

StateStoreCoordinator RPC Endpoint — 


Tracking Locations of StateStores for
StateStoreRDD
StateStoreCoordinator keeps track of state stores on Spark executors (per host and

executor ID).

StateStoreCoordinator is used by StateStoreRDD when requested to get the location

preferences of partitions (based on the location of the stores).

StateStoreCoordinator is a ThreadSafeRpcEndpoint RPC endpoint that manipulates

instances registry through RPC messages.

Table 1. StateStoreCoordinator RPC Endpoint’s Messages and Message Handlers


Message Message Handler
Removes StateStoreProviderIds of a streaming query (given runId

Internally, StateStoreCoordinator finds the StateStoreProviderIds


DeactivateInstances
query per queryRunId and the given runId and removes them from the
internal registry.
StateStoreCoordinator prints out the following DEBUG message to the log

Deactivating instances related to checkpoint location [runId]: [storeIds

Gives the location of StateStoreProviderId (from instances) with the host an


executor id on that host.

GetLocation You should see the following DEBUG message in the logs:

Got location of the state store [id]: [executorId]

One-way asynchronous (fire-and-forget) message to register a new


StateStoreProviderId on an executor (given host and executorId
Sent out exclusively when StateStoreCoordinatorRef RPC endpoint referen
requested to reportActiveInstance (when StateStore utility is requested to
StateStore by provider ID when the StateStore and a corresponding
StateStoreProvider were just created and initialized).

ReportActiveInstance

178
StateStoreCoordinator

Internally, StateStoreCoordinator prints out the following DEBUG message

Reported state store [id] is active at [executorId]

In the end, StateStoreCoordinator adds the StateStoreProviderId


internal registry.

Stops StateStoreCoordinator RPC Endpoint

You should see the following DEBUG message in the logs:


StopCoordinator

StateStoreCoordinator stopped

Verifies if a given StateStoreProviderId is registered (in instances


You should see the following DEBUG message in the logs:
VerifyIfInstanceActive

Verified that state store [id] is active: [response]

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.state.StateStoreCoordinator logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.state.StateStoreCoordinator=ALL

Refer to Logging.

instances Internal Registry

instances: HashMap[StateStoreProviderId, ExecutorCacheTaskLocation]

instances is an internal registry of StateStoreProviders by their StateStoreProviderIds and

ExecutorCacheTaskLocations (with a host and a executorId ).

A new StateStoreProviderId added when StateStoreCoordinator is requested to


handle a ReportActiveInstance message

All StateStoreProviderIds of a streaming query are removed when


StateStoreCoordinator is requested to handle a DeactivateInstances message

179
StateStoreCoordinator

180
StateStoreCoordinatorRef

StateStoreCoordinatorRef — RPC Endpoint
Reference to StateStoreCoordinator
StateStoreCoordinatorRef is used to (let the tasks on Spark executors to) send messages to

the StateStoreCoordinator (that lives on the driver).

StateStoreCoordinatorRef is given the RpcEndpointRef to the StateStoreCoordinator RPC

endpoint when created.

StateStoreCoordinatorRef is created through StateStoreCoordinatorRef helper object when

requested to create one for the driver (when StreamingQueryManager is created) or an


executor (when StateStore helper object is requested for the RPC endpoint reference to
StateStoreCoordinator for Executors).

Table 1. StateStoreCoordinatorRef’s Methods and Underlying RPC Messages


Method Description

deactivateInstances(runId: UUID): Unit

Requests the RpcEndpointRef to send a


DeactivateInstances synchronous message with the given
deactivateInstances
runId and waits for a true / false response

Used exclusively when StreamingQueryManager is requested


to handle termination of a streaming query (when
StreamExecution is requested to run a streaming query and
the query has finished (running streaming batches)).

getLocation(
stateStoreProviderId: StateStoreProviderId): Option[
String]

Requests the RpcEndpointRef to send a GetLocation


synchronous message with the given StateStoreProviderId
and waits for the location
Used when:
getLocation
StateStoreAwareZipPartitionsRDD is requested for the
preferred locations of a partition (when
StreamingSymmetricHashJoinExec physical operator is
requested to execute and generate a recipe for a
distributed computation (as an RDD[InternalRow]))
StateStoreRDD is requested for preferred locations for
a task for a partition

181
StateStoreCoordinatorRef

reportActiveInstance(
stateStoreProviderId: StateStoreProviderId,
host: String,
executorId: String): Unit

Requests the RpcEndpointRef to send a


reportActiveInstance ReportActiveInstance one-way asynchronous (fire-and-
forget) message with the given StateStoreProviderId, host
and executorId
Used exclusively when StateStore utility is requested for
reportActiveStoreInstance (when StateStore utility is
requested to look up the StateStore by
StateStoreProviderId)

stop(): Unit

stop
Requests the RpcEndpointRef to send a StopCoordinator
synchronous message
Used exclusively for unit testing

verifyIfInstanceActive(
stateStoreProviderId: StateStoreProviderId,
executorId: String): Boolean

Requests the RpcEndpointRef to send a


verifyIfInstanceActive
VerifyIfInstanceActive synchronous message with the given
StateStoreProviderId and executorId , and waits for a
true / false response

Used exclusively when StateStore helper object is


requested for verifyIfStoreInstanceActive (when requested
to doMaintenance from a running MaintenanceTask daemon
thread)

Creating StateStoreCoordinatorRef to
StateStoreCoordinator RPC Endpoint for Driver 
—  forDriver Factory Method

forDriver(env: SparkEnv): StateStoreCoordinatorRef

forDriver …​FIXME

Note forDriver is used exclusively when StreamingQueryManager is created.

182
StateStoreCoordinatorRef

Creating StateStoreCoordinatorRef to
StateStoreCoordinator RPC Endpoint for Executor 
—  forExecutor Factory Method

forExecutor(env: SparkEnv): StateStoreCoordinatorRef

forExecutor …​FIXME

forExecutor is used exclusively when StateStore helper object is requested


Note
for the RPC endpoint reference to StateStoreCoordinator for Executors.

183
WatermarkSupport

WatermarkSupport Contract — Unary Physical


Operators with Streaming Watermark Support
WatermarkSupport is the abstraction of unary physical operators ( UnaryExecNode ) with

support for streaming event-time watermark.

Watermark (aka "allowed lateness") is a moving threshold of event time and


specifies what data to consider for aggregations, i.e. the threshold of late data
so the engine can automatically drop incoming late data given event time and
Note clean up old state accordingly.
Read the official documentation of Spark in Handling Late Data and
Watermarking.

184
WatermarkSupport

Table 1. WatermarkSupport’s (Lazily-Initialized) Properties


Property Description

Optional Catalyst expression that matches rows older than the event tim
watermark.

Note Use withWatermark operator to specify streaming watermark

When initialized, watermarkExpression finds spark.watermarkDelayMs


watermark attribute in the child output’s metadata.
If found, watermarkExpression creates evictionExpression with the wat
attribute that is less than or equal eventTimeWatermark.

watermarkExpression
The watermark attribute may be of type StructType . If it is,
watermarkExpression uses the first field as the watermark.

watermarkExpression prints out the following INFO message to the logs


spark.watermarkDelayMs watermark attribute is found.

INFO [physicalOperator]Exec: Filtering state store on: [evictionExpre

physicalOperator can be FlatMapGroupsWithStateExec


Note
StateStoreSaveExec or StreamingDeduplicateExec

Enable INFO logging level for one of the stateful physical opera
Tip
see the INFO message in the logs.

watermarkPredicateForData
Optional Predicate that uses watermarkExpression and the child outpu
match rows older than the event-time watermark

watermarkPredicateForKeys
Optional Predicate that uses keyExpressions to match rows older than
event time watermark.

WatermarkSupport Contract

package org.apache.spark.sql.execution.streaming

trait WatermarkSupport extends UnaryExecNode {


// only required methods that have no implementation
def eventTimeWatermark: Option[Long]
def keyExpressions: Seq[Attribute]
}

185
WatermarkSupport

Table 2. WatermarkSupport Contract


Method Description

Used mainly in watermarkExpression to create a


eventTimeWatermark LessThanOrEqual Catalyst binary expression that
matches rows older than the watermark.

Grouping keys (in FlatMapGroupsWithStateExec),


duplicate keys (in StreamingDeduplicateExec) or key
attributes (in StateStoreSaveExec) with at most one that
may have spark.watermarkDelayMs watermark attribute
in metadata
keyExpressions Used in watermarkPredicateForKeys to create a
Predicate to match rows older than the event time
watermark.
Used also when StateStoreSaveExec and
StreamingDeduplicateExec physical operators are
executed.

Removing Keys From StateStore Older Than Watermark 


—  removeKeysOlderThanWatermark Method

removeKeysOlderThanWatermark(store: StateStore): Unit

removeKeysOlderThanWatermark requests the input store for all rows.

removeKeysOlderThanWatermark then uses watermarkPredicateForKeys to remove matching

rows from the store.

removeKeysOlderThanWatermark is used exclusively when


Note StreamingDeduplicateExec physical operator is requested to execute and
generate a recipe for a distributed computation (as an RDD[InternalRow]).

removeKeysOlderThanWatermark Method

removeKeysOlderThanWatermark(
storeManager: StreamingAggregationStateManager,
store: StateStore): Unit

removeKeysOlderThanWatermark …​FIXME

186
WatermarkSupport

removeKeysOlderThanWatermark is used exclusively when StateStoreSaveExec


Note physical operator is requested to execute and generate a recipe for a
distributed computation (as an RDD[InternalRow]).

187
StatefulOperator

StatefulOperator Contract — Physical
Operators That Read or Write to StateStore
StatefulOperator is the base of physical operators that read or write state (described by

stateInfo).

Table 1. StatefulOperator Contract


Method Description

stateInfo: Option[StatefulOperatorStateInfo]
stateInfo

The StatefulOperatorStateInfo of the physical operator

Table 2. StatefulOperators (Direct Implementations)


StatefulOperator Description

StateStoreReader

Physical operator that writes to a state store and collects


StateStoreWriter
the write metrics for execution progress reporting

188
StateStoreReader

StateStoreReader
StateStoreReader is…​FIXME

189
StateStoreWriter

StateStoreWriter Contract — Stateful Physical


Operators That Write to State Store
StateStoreWriter is the extension of the StatefulOperator Contract for physical operators

that write to a state store and collect the write metrics for execution progress reporting.

Table 1. StateStoreWriters
StateStoreWriter Description
FlatMapGroupsWithStateExec

StateStoreSaveExec

StreamingDeduplicateExec

StreamingGlobalLimitExec

StreamingSymmetricHashJoinExec

Performance Metrics (SQLMetrics)

Name (in web UI) Description


number of output rows

number of total state


rows

number of updated
state rows

total time to update


rows

total time to remove


rows

time to commit changes

memory used by state

190
StateStoreWriter

Setting StateStore-Specific Metrics for Stateful Physical


Operator —  setStoreMetrics Method

setStoreMetrics(store: StateStore): Unit

setStoreMetrics requests the specified StateStore for the metrics and records the following

metrics of a physical operator:

numTotalStateRows as the number of keys

stateMemory as the memory used (in bytes)

setStoreMetrics records the custom metrics.

setStoreMetrics is used when the following physical operators are executed:

FlatMapGroupsWithStateExec

Note StateStoreSaveExec

StreamingDeduplicateExec
StreamingGlobalLimitExec

getProgress Method

getProgress(): StateOperatorProgress

getProgress …​FIXME

getProgress is used exclusively when ProgressReporter is requested to


Note extractStateOperatorMetrics (when MicroBatchExecution is requested to run
the activated streaming query).

Checking Out Whether Last Batch Execution Requires


Another Non-Data Batch or Not 
—  shouldRunAnotherBatch Method

shouldRunAnotherBatch(newMetadata: OffsetSeqMetadata): Boolean

shouldRunAnotherBatch is negative ( false ) by default (to indicate that another non-data

batch is not required given the OffsetSeqMetadata with the event-time watermark and the
batch timestamp).

191
StateStoreWriter

shouldRunAnotherBatch is used exclusively when IncrementalExecution is


Note requested to check out whether the last batch execution requires another batch
(when MicroBatchExecution is requested to run the activated streaming query).

stateStoreCustomMetrics Internal Method

stateStoreCustomMetrics: Map[String, SQLMetric]

stateStoreCustomMetrics …​FIXME

stateStoreCustomMetrics is used when StateStoreWriter is requested for the


Note
metrics and getProgress.

timeTakenMs Method

timeTakenMs(body: => Unit): Long

timeTakenMs …​FIXME

Note timeTakenMs is used when…​FIXME

192
StatefulOperatorStateInfo

StatefulOperatorStateInfo
StatefulOperatorStateInfo identifies the state store for a given stateful physical operator:

Checkpoint directory ( checkpointLocation )

Run ID of a streaming query ( queryRunId )

Stateful operator ID ( operatorId )

State version ( storeVersion )

Number of partitions

StatefulOperatorStateInfo is created exclusively when IncrementalExecution is requested

for nextStatefulOperationStateInfo.

When requested for a textual representation ( toString ), StatefulOperatorStateInfo


returns the following:

state info [ checkpoint = [checkpointLocation], runId = [queryRunId], opId = [operator


Id], ver = [storeVersion], numPartitions = [numPartitions]]

State Version and Batch ID


When created (when IncrementalExecution is requested for the next
StatefulOperatorStateInfo), a StatefulOperatorStateInfo is given a state version.

The state version is exactly the batch ID of the IncrementalExecution.

193
StateStoreMetrics

StateStoreMetrics
StateStoreMetrics holds the performance metrics of a state store:

Number of keys

Memory used (in bytes)

StateStoreCustomMetrics with their current values ( Map[StateStoreCustomMetric,


Long] )

StateStoreMetrics is used (and created) when the following are requested for the

performance metrics:

StateStore

StateStoreHandler

SymmetricHashJoinStateManager

194
StateStoreCustomMetric

StateStoreCustomMetric Contract
StateStoreCustomMetric is the abstraction of metrics that a state store may wish to expose

(as StateStoreMetrics or supportedCustomMetrics).

StateStoreCustomMetric is used when:

StateStoreProvider is requested for the custom metrics

StateStoreMetrics is created

Table 1. StateStoreCustomMetric Contract


Method Description

desc: String
desc

Description of the custom metrics

name: String
name

Name of the custom metrics

Table 2. StateStoreCustomMetrics
StateStoreCustomMetric Description
StateStoreCustomSizeMetric

StateStoreCustomSumMetric

StateStoreCustomTimingMetric

195
StateStoreUpdater

StateStoreUpdater
StateStoreUpdater is…​FIXME

updateStateForKeysWithData Method

Caution FIXME

updateStateForTimedOutKeys Method

Caution FIXME

196
EventTimeStatsAccum

EventTimeStatsAccum Accumulator — Event-
Time Column Statistics for
EventTimeWatermarkExec Physical Operator
EventTimeStatsAccum is a Spark accumulator that is used for the statistics of the event-time

column (that EventTimeWatermarkExec physical operator uses for event-time watermark):

Maximum value

Minimum value

Average value

Number of updates (count)

EventTimeStatsAccum is created and registered exclusively for EventTimeWatermarkExec

physical operator.

When EventTimeWatermarkExec physical operator is requested to execute and


generate a recipe for a distributed computation (as a RDD[InternalRow]), every
task simply adds the values of the event-time watermark column to the
EventTimeStatsAccum accumulator.
Note
As per design of Spark accumulators in Apache Spark, accumulator updates
are automatically sent out (propagated) from tasks to the driver every heartbeat
and then they are accumulated together.

Tip Read up on Accumulators in The Internals of Apache Spark book.

EventTimeStatsAccum takes a single EventTimeStats to be created (default: zero).

Accumulating Value —  add Method

add(v: Long): Unit

Note add is part of the AccumulatorV2 Contract to add (accumulate) a given value.

add simply requests the EventTimeStats to add the given v value.

add is used exclusively when EventTimeWatermarkExec physical operator is


Note requested to execute and generate a recipe for a distributed computation (as a
RDD[InternalRow]).

197
EventTimeStatsAccum

EventTimeStats
EventTimeStats is a Scala case class for the event-time column statistics.

EventTimeStats defines a special value zero with the following values:

Long.MinValue for the max

Long.MaxValue for the min

0.0 for the avg

0L for the count

EventTimeStats.add Method

add(eventTime: Long): Unit

add simply updates the event-time column statistics per given eventTime .

add is used exclusively when EventTimeStatsAccum is requested to


Note
accumulate the value of an event-time column.

EventTimeStats.merge Method

merge(that: EventTimeStats): Unit

merge …​FIXME

Note merge is used when…​FIXME

198
StateStoreConf

StateStoreConf
StateStoreConf is…​FIXME

Table 1. StateStoreConf’s Properties


Name Configuration Property
minDeltasForSnapshot spark.sql.streaming.stateStore.minDeltasForSnapshot

maxVersionsToRetainInMemory spark.sql.streaming.maxBatchesToRetainInMemory

spark.sql.streaming.minBatchesToRetain
minVersionsToRetain
Used exclusively when HDFSBackedStateStoreProvider
is requested for cleanup.

spark.sql.streaming.stateStore.providerClass
providerClass Used exclusively when StateStoreProvider helper
object is requested to create and initialize the
StateStoreProvider.

199
Arbitrary Stateful Streaming Aggregation

Arbitrary Stateful Streaming Aggregation


Arbitrary Stateful Streaming Aggregation is a streaming aggregation query that uses the
following KeyValueGroupedDataset operators:

mapGroupsWithState for implicit state logic

flatMapGroupsWithState for explicit state logic

KeyValueGroupedDataset represents a grouped dataset as a result of Dataset.groupByKey

operator.

mapGroupsWithState and flatMapGroupsWithState operators use GroupState as group

streaming aggregation state that is created separately for every aggregation key with an
aggregation state value (of a user-defined type).

mapGroupsWithState and flatMapGroupsWithState operators use GroupStateTimeout as an

aggregation state timeout that defines when a GroupState can be considered timed-out
(expired).

Demos
Use the following demos and complete applications to learn more:

Demo: Internals of FlatMapGroupsWithStateExec Physical Operator

Demo: Arbitrary Stateful Streaming Aggregation with


KeyValueGroupedDataset.flatMapGroupsWithState Operator

groupByKey Streaming Aggregation in Update Mode

FlatMapGroupsWithStateApp

Performance Metrics
Arbitrary Stateful Streaming Aggregation uses performance metrics (of the
StateStoreWriter through FlatMapGroupsWithStateExec physical operator).

Internals
One of the most important internal execution components of Arbitrary Stateful Streaming
Aggregation is FlatMapGroupsWithStateExec physical operator.

200
Arbitrary Stateful Streaming Aggregation

When requested to execute and generate a recipe for a distributed computation (as an
RDD[InternalRow]), FlatMapGroupsWithStateExec first validates a selected
GroupStateTimeout:

For ProcessingTimeTimeout, batch timeout threshold has to be defined

For EventTimeTimeout, event-time watermark has to be defined and the input schema
has the watermark attribute

Note FIXME When are the above requirements met?

FlatMapGroupsWithStateExec physical operator then mapPartitionsWithStateStore with a

custom storeUpdateFunction of the following signature:

(StateStore, Iterator[T]) => Iterator[U]

While generating the recipe, FlatMapGroupsWithStateExec uses StateStoreOps extension


method object to register a listener that is executed on a task completion. The listener
makes sure that a given StateStore has all state changes either committed or aborted.

In the end, FlatMapGroupsWithStateExec creates a new StateStoreRDD and adds it to the


RDD lineage.

StateStoreRDD is used to properly distribute tasks across executors (per preferred locations)

with help of StateStoreCoordinator (that runs on the driver).

StateStoreRDD uses StateStore helper to look up a StateStore by StateStoreProviderId

and store version.

FlatMapGroupsWithStateExec physical operator uses state managers that are different than

state managers for Streaming Aggregation. StateStore abstraction is the same as in


Streaming Aggregation.

One of the important execution steps is when InputProcessor (of


FlatMapGroupsWithStateExec physical operator) is requested to
callFunctionAndUpdateState. That executes the user-defined state function on a per-
group state key object, value objects, and a GroupStateImpl.

201
GroupState

GroupState — Group State in Arbitrary Stateful


Streaming Aggregation
GroupState is an abstraction of group state (of type S ) in Arbitrary Stateful Streaming

Aggregation.

GroupState is used with the following KeyValueGroupedDataset operations:

mapGroupsWithState

flatMapGroupsWithState

GroupState is created separately for every aggregation key to hold a state as an

aggregation state value.

Table 1. GroupState Contract


Method Description

exists: Boolean

exists
Checks whether the state value exists or not
If not exists, get throws a NoSuchElementException . Use
getOption instead.

get: S

get

Gets the state value if it exists or throws a


NoSuchElementException

getCurrentProcessingTimeMs(): Long

getCurrentProcessingTimeMs
Gets the current processing time (as milliseconds in
epoch time)

getCurrentWatermarkMs(): Long

getCurrentWatermarkMs
Gets the current event time watermark (as milliseconds
in epoch time)

getOption: Option[S]

202
GroupState

Gets the state value as a Scala Option (regardless


whether it exists or not)

getOption
Used when:

InputProcessor is requested to
callFunctionAndUpdateState (when the row iterator
is consumed and a state value has been updated,
removed or timeout changed)
GroupStateImpl is requested for the textual
representation

hasTimedOut: Boolean

hasTimedOut
Whether the state (for a given key) has timed out or not.

Can only be true when timeouts are enabled using


setTimeoutDuration

remove(): Unit
remove

Removes the state

setTimeoutDuration(durationMs: Long): Unit


setTimeoutDuration(duration: String): Unit

setTimeoutDuration
Specifies the timeout duration for the state key (in
millis or as a string, e.g. "10 seconds", "1 hour") for
GroupStateTimeout.ProcessingTimeTimeout

setTimeoutTimestamp(timestamp: java.sql.Date): Unit


setTimeoutTimestamp(
timestamp: java.sql.Date,
additionalDuration: String): Unit
setTimeoutTimestamp(timestampMs: Long): Unit
setTimeoutTimestamp(
setTimeoutTimestamp
timestampMs: Long,
additionalDuration: String): Unit

Specifies the timeout timestamp for the state key for


GroupStateTimeout.EventTimeTimeout

update(newState: S): Unit


update

Updates the state (sets the state to a new value)

203
GroupState

GroupStateImpl is the default and only known implementation of the


Note
GroupState Contract in Spark Structured Streaming.

204
GroupStateImpl

GroupStateImpl
GroupStateImpl is the default and only known GroupState in Spark Structured Streaming.

GroupStateImpl holds per-group state value of type S per group key.

GroupStateImpl is created when GroupStateImpl helper object is requested for the

following:

createForStreaming

createForBatch

Creating GroupStateImpl Instance


GroupStateImpl takes the following to be created:

State value (of type S )

Batch processing time

eventTimeWatermarkMs

GroupStateTimeout

hasTimedOut flag

watermarkPresent flag

GroupStateImpl initializes the internal properties.

Creating GroupStateImpl for Streaming Query 


—  createForStreaming Factory Method

createForStreaming[S](
optionalValue: Option[S],
batchProcessingTimeMs: Long,
eventTimeWatermarkMs: Long,
timeoutConf: GroupStateTimeout,
hasTimedOut: Boolean,
watermarkPresent: Boolean): GroupStateImpl[S]

createForStreaming simply creates a new GroupStateImpl with the given input arguments.

205
GroupStateImpl

createForStreaming is used exclusively when InputProcessor is requested to


Note callFunctionAndUpdateState (when InputProcessor is requested to
processNewData and processTimedOutState).

Creating GroupStateImpl for Batch Query 


—  createForBatch Factory Method

createForBatch(
timeoutConf: GroupStateTimeout,
watermarkPresent: Boolean): GroupStateImpl[Any]

createForBatch …​FIXME

Note createForBatch is used when…​FIXME

Textual Representation —  toString Method

toString: String

toString is part of the java.lang.Object contract for the string representation of


Note
the object.

toString …​FIXME

Specifying Timeout Duration for ProcessingTimeTimeout 


—  setTimeoutDuration Method

setTimeoutDuration(durationMs: Long): Unit

setTimeoutDuration is part of the GroupState Contract to specify timeout


Note
duration for the state key (in millis or as a string).

setTimeoutDuration …​FIXME

Specifying Timeout Timestamp for EventTimeTimeout 


—  setTimeoutTimestamp Method

setTimeoutTimestamp(durationMs: Long): Unit

206
GroupStateImpl

setTimeoutTimestamp is part of the GroupState Contract to specify timeout


Note
timestamp for the state key.

setTimeoutTimestamp …​FIXME

Getting Processing Time 


—  getCurrentProcessingTimeMs Method

getCurrentProcessingTimeMs(): Long

getCurrentProcessingTimeMs is part of the GroupState Contract to get the


Note
current processing time (as milliseconds in epoch time).

getCurrentProcessingTimeMs simply returns the batchProcessingTimeMs.

Updating State —  update Method

update(newValue: S): Unit

Note update is part of the GroupState Contract to update the state.

update …​FIXME

Removing State —  remove Method

remove(): Unit

Note remove is part of the GroupState Contract to remove the state.

remove …​FIXME

Internal Properties

207
GroupStateImpl

Name Description

FIXME
value
Used when…​FIXME

FIXME
defined
Used when…​FIXME

Updated flag that says whether the state has been updated
or not

Default: false
updated
Disabled ( false ) when GroupStateImpl is requested to
remove the state
Enabled ( true ) when GroupStateImpl is requested to
update the state

Removed flag that says whether the state is marked


removed or not
Default: false
removed
Disabled ( false ) when GroupStateImpl is requested to
update the state
Enabled ( true ) when GroupStateImpl is requested to
remove the state

Current timeout timestamp (in millis) for


GroupStateTimeout.EventTimeTimeout or
GroupStateTimeout.ProcessingTimeTimeout
timeoutTimestamp
Default: -1
Defined using setTimeoutTimestamp (for EventTimeTimeout )
and setTimeoutDuration (for ProcessingTimeTimeout )

208
GroupStateTimeout

GroupStateTimeout — Group State Timeout in


Arbitrary Stateful Streaming Aggregation
GroupStateTimeout represents an aggregation state timeout that defines when a

GroupState can be considered timed-out (expired) in Arbitrary Stateful Streaming


Aggregation.

GroupStateTimeout is used with the following KeyValueGroupedDataset operations:

mapGroupsWithState

flatMapGroupsWithState

Table 1. GroupStateTimeouts
GroupStateTimeout Description
Timeout based on event time
EventTimeTimeout
Used when…​FIXME

No timeout
NoTimeout
Used when…​FIXME

Timeout based on processing time


FlatMapGroupsWithStateExec physical operator requires that
batchTimestampMs is specified when ProcessingTimeTimeout

ProcessingTimeTimeout
is used.
batchTimestampMs is defined when IncrementalExecution is
created (with the state). IncrementalExecution is given
OffsetSeqMetadata when StreamExecution is requested to
run a streaming batch.

209
StateManager

StateManager Contract — State Managers for


Arbitrary Stateful Streaming Aggregation
StateManager is the abstraction of state managers that act as middlemen between state

stores and the FlatMapGroupsWithStateExec physical operator used in Arbitrary Stateful


Streaming Aggregation.

Table 1. StateManager Contract


Method Description

getAllState(store: StateStore): Iterator[StateData]

getAllState
Retrieves all state data (for all keys) from the StateStore
Used exclusively when InputProcessor is requested to
processTimedOutState

getState(
store: StateStore,
keyRow: UnsafeRow): StateData

getState
Gets the state data for the key from the StateStore
Used exclusively when InputProcessor is requested to
processNewData

putState(
store: StateStore,
keyRow: UnsafeRow,
state: Any,
timeoutTimestamp: Long): Unit

putState

Persists (puts) the state value for the key in the StateStore
Used exclusively when InputProcessor is requested to
callFunctionAndUpdateState (right after all rows have been
processed)

removeState(
store: StateStore,
keyRow: UnsafeRow): Unit

Removes the state for the key from the StateStore


removeState

210
StateManager

Used exclusively when InputProcessor is requested to


callFunctionAndUpdateState (right after all rows have been
processed)

stateSchema: StructType

State schema

It looks like (in StateManager of the


FlatMapGroupsWithStateExec physical
operator) stateSchema is used for the schema
Note
of state value objects (not state keys as they
stateSchema are described by the grouping attributes
instead).

Used when:
FlatMapGroupsWithStateExec physical operator is
requested to execute and generate a recipe for a
distributed computation (as an RDD[InternalRow])

StateManagerImplBase is requested for the


stateDeserializerFunc

StateManagerImplBase is the one and only known direct implementation of the


Note
StateManager Contract in Spark Structured Streaming.

StateManager is a Scala sealed trait which means that all the implementations
Note
are in the same compilation unit (a single file).

211
StateManagerImplV2

StateManagerImplV2 — Default StateManager
of FlatMapGroupsWithStateExec Physical
Operator
StateManagerImplV2 is a concrete StateManager (as a StateManagerImplBase) that is used

by default in FlatMapGroupsWithStateExec physical operator (per


spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion internal configuration
property).

StateManagerImplV2 is created exclusively when FlatMapGroupsWithStateExecHelper utility is

requested for a StateManager (when the stateFormatVersion is 2 ).

Creating StateManagerImplV2 Instance


StateManagerImplV2 takes the following to be created:

State encoder ( ExpressionEncoder[Any] )

shouldStoreTimestamp flag

StateManagerImplV2 initializes the internal properties.

State Schema —  stateSchema Value

stateSchema: StructType

Note stateSchema is part of the StateManager Contract for the schema of the state.

stateSchema …​FIXME

State Serializer —  stateSerializerExprs Value

stateSerializerExprs: Seq[Expression]

stateSerializerExprs is part of the StateManager Contract for the state


Note serializer, i.e. Catalyst expressions to serialize a state object to a row
( UnsafeRow ).

stateSerializerExprs …​FIXME

212
StateManagerImplV2

State Deserializer —  stateDeserializerExpr Value

stateDeserializerExpr: Expression

stateDeserializerExpr is part of the StateManager Contract for the state


Note deserializer, i.e. a Catalyst expression to deserialize a state object from a row
( UnsafeRow ).

stateDeserializerExpr …​FIXME

Internal Properties

Name Description
Position of the state in a state row ( 0 )
nestedStateOrdinal
Used when…​FIXME

Position of the timeout timestamp in a state row ( 1 )


timeoutTimestampOrdinalInRow
Used when…​FIXME

213
StateManagerImplBase

StateManagerImplBase
StateManagerImplBase is the extension of the StateManager contract for state managers of

FlatMapGroupsWithStateExec physical operator with the following features:

Use Catalyst expressions for state serialization and deserialization

Use timeoutTimestampOrdinalInRow when shouldStoreTimestamp with the


shouldStoreTimestamp flag on

Table 1. StateManagerImplBase Contract (Abstract Methods Only)


Method Description

stateDeserializerExpr: Expression

stateDeserializerExpr
State deserializer, i.e. a Catalyst expression to
deserialize a state object from a row ( UnsafeRow )
Used exclusively for the stateDeserializerFunc

stateSerializerExprs: Seq[Expression]

stateSerializerExprs
State serializer, i.e. Catalyst expressions to serialize
a state object to a row ( UnsafeRow )
Used exclusively for the stateSerializerFunc

timeoutTimestampOrdinalInRow: Int

timeoutTimestampOrdinalInRow
Position of the timeout timestamp in a state row
Used when StateManagerImplBase is requested to get
and set timeout timestamp

Table 2. StateManagerImplBases
StateManagerImplBase Description
StateManagerImplV1 Legacy StateManager

StateManagerImplV2 Default StateManager

Creating StateManagerImplBase Instance

214
StateManagerImplBase

StateManagerImplBase takes a single shouldStoreTimestamp flag to be created (that is set

when the concrete StateManagerImplBases are created).

StateManagerImplBase is a Scala abstract class and cannot be created directly.


Note
It is created indirectly for the concrete StateManagerImplBases.

StateManagerImplBase initializes the internal properties.

Getting State Data for Key from StateStore —  getState


Method

getState(
store: StateStore,
keyRow: UnsafeRow): StateData

getState is part of the StateManager Contract to get the state data for the key
Note
from the StateStore.

getState …​FIXME

Persisting State Value for Key in StateStore —  putState


Method

putState(
store: StateStore,
key: UnsafeRow,
state: Any,
timestamp: Long): Unit

putState is part of the StateManager Contract to persist (put) the state value
Note
for the key in the StateStore.

putState …​FIXME

Removing State for Key from StateStore —  removeState


Method

removeState(
store: StateStore,
keyRow: UnsafeRow): Unit

215
StateManagerImplBase

removeState is part of the StateManager Contract to remove the state for the
Note
key from the StateStore.

removeState …​FIXME

Getting All State Data (for All Keys) from StateStore 


—  getAllState Method

getAllState(store: StateStore): Iterator[StateData]

getAllState is part of the StateManager Contract to retrieve all state data (for
Note
all keys) from the StateStore.

getAllState …​FIXME

getStateObject Internal Method

getStateObject(row: UnsafeRow): Any

getStateObject …​FIXME

Note getStateObject is used when…​FIXME

getStateRow Internal Method

getStateRow(obj: Any): UnsafeRow

getStateRow …​FIXME

Note getStateRow is used when…​FIXME

Getting Timeout Timestamp (from State Row) 


—  getTimestamp Internal Method

getTimestamp(stateRow: UnsafeRow): Long

getTimestamp …​FIXME

216
StateManagerImplBase

Note getTimestamp is used when…​FIXME

Setting Timeout Timestamp (to State Row) 


—  setTimestamp Internal Method

setTimestamp(
stateRow: UnsafeRow,
timeoutTimestamps: Long): Unit

setTimestamp …​FIXME

Note setTimestamp is used when…​FIXME

Internal Properties
Name Description

State object serializer (of type Any ⇒ UnsafeRow ) to


serialize a state object (for a per-group state key) to a row
( UnsafeRow )
stateSerializerFunc
The serialization expression (incl. the type) is specified
as the stateSerializerExprs
Used exclusively in getStateRow

State object deserializer (of type InternalRow ⇒ Any ) to


deserialize a row (for a per-group state value) to a Scala
value
stateDeserializerFunc
The deserialization expression (incl. the type) is
specified as the stateDeserializerExpr
Used exclusively in getStateObject

stateDataForGets
Empty StateData to share (reuse) between getState calls
(to avoid high use of memory with many StateData objects)

217
StateManagerImplV1

StateManagerImplV1
StateManagerImplV1 is…​FIXME

218
FlatMapGroupsWithStateExecHelper Helper Class

FlatMapGroupsWithStateExecHelper
FlatMapGroupsWithStateExecHelper is a utility with the main purpose of creating a

StateManager for FlatMapGroupsWithStateExec physical operator.

Creating StateManager —  createStateManager Method

createStateManager(
stateEncoder: ExpressionEncoder[Any],
shouldStoreTimestamp: Boolean,
stateFormatVersion: Int): StateManager

createStateManager simply creates a StateManager (with the stateEncoder and

shouldStoreTimestamp flag) based on stateFormatVersion :

StateManagerImplV1 for 1

StateManagerImplV2 for 2

createStateManager throws an IllegalArgumentException for stateFormatVersion not 1 or

2 :

Version [stateFormatVersion] is invalid

createStateManager is used exclusively for the StateManager for


Note
FlatMapGroupsWithStateExec physical operator.

219
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator

InputProcessor Helper Class of


FlatMapGroupsWithStateExec Physical
Operator
InputProcessor is a helper class to manage state in the state store for every partition of a

FlatMapGroupsWithStateExec physical operator.

InputProcessor is created exclusively when FlatMapGroupsWithStateExec physical operator

is requested to execute and generate a recipe for a distributed computation (as an


RDD[InternalRow]) (and uses InputProcessor in the storeUpdateFunction while processing
rows per partition with a corresponding per-partition state store).

InputProcessor takes a single StateStore to be created. The StateStore manages the per-

group state (and is used when processing new data and timed-out state data, and in the "all
rows processed" callback).

Processing New Data (Creating Iterator of New Data


Processed) —  processNewData Method

processNewData(dataIter: Iterator[InternalRow]): Iterator[InternalRow]

processNewData creates a grouped iterator of (of pairs of) per-group state keys and the row

values from the given data iterator ( dataIter ) with the grouping attributes and the output
schema of the child operator (of the parent FlatMapGroupsWithStateExec physical operator).

For every per-group state key (in the grouped iterator), processNewData requests the
StateManager (of the parent FlatMapGroupsWithStateExec physical operator) to get the state
(from the StateStore) and callFunctionAndUpdateState (with the hasTimedOut flag off, i.e.
false ).

processNewData is used exclusively when FlatMapGroupsWithStateExec physical


Note operator is requested to execute and generate a recipe for a distributed
computation (as an RDD[InternalRow]).

Processing Timed-Out State Data (Creating Iterator of


Timed-Out State Data) —  processTimedOutState
Method

220
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator

processTimedOutState(): Iterator[InternalRow]

processTimedOutState does nothing and simply returns an empty iterator for

GroupStateTimeout.NoTimeout.

With timeout enabled, processTimedOutState gets the current timeout threshold per
GroupStateTimeout:

batchTimestampMs for ProcessingTimeTimeout

eventTimeWatermark for EventTimeTimeout

processTimedOutState creates an iterator of timed-out state data by requesting the

StateManager for all the available state data (in the StateStore) and takes only the state
data with timeout defined and below the current timeout threshold.

In the end, for every timed-out state data, processTimedOutState


callFunctionAndUpdateState (with the hasTimedOut flag enabled).

processTimedOutState is used exclusively when FlatMapGroupsWithStateExec


Note physical operator is requested to execute and generate a recipe for a
distributed computation (as an RDD[InternalRow]).

callFunctionAndUpdateState Internal Method

callFunctionAndUpdateState(
stateData: StateData,
valueRowIter: Iterator[InternalRow],
hasTimedOut: Boolean): Iterator[InternalRow]

callFunctionAndUpdateState is used when InputProcessor is requested to


process new data and timed-out state data.
Note When processing new data, hasTimedOut flag is off ( false ).
When processing timed-out state data, hasTimedOut flag is on ( true ).

callFunctionAndUpdateState creates a key object by requesting the given StateData for the

UnsafeRow of the key (keyRow) and converts it to an object (using the internal state key

converter).

callFunctionAndUpdateState creates value objects by taking every value row (from the given

valueRowIter iterator) and converts them to objects (using the internal state value

converter).

221
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator

callFunctionAndUpdateState creates a new GroupStateImpl with the following:

The current state value (of the given StateData ) that could possibly be null

The batchTimestampMs of the parent FlatMapGroupsWithStateExec operator (that could


possibly be -1)

The event-time watermark of the parent FlatMapGroupsWithStateExec operator (that


could possibly be -1)

The GroupStateTimeout of the parent FlatMapGroupsWithStateExec operator

The watermarkPresent flag of the parent FlatMapGroupsWithStateExec operator

The given hasTimedOut flag

callFunctionAndUpdateState then executes the user-defined state function (of the parent

FlatMapGroupsWithStateExec operator) on the key object, value objects, and the newly-

created GroupStateImpl .

For every output value from the user-defined state function, callFunctionAndUpdateState
updates numOutputRows performance metric and wraps the values to an internal row (using
the internal output value converter).

In the end, callFunctionAndUpdateState returns a Iterator[InternalRow] which calls the


completion function right after rows have been processed (so the iterator is considered fully
consumed).

"All Rows Processed" Callback —  onIteratorCompletion


Internal Method

onIteratorCompletion: Unit

onIteratorCompletion branches off per whether the GroupStateImpl has been marked

removed and no timeout timestamp is specified or not.

When the GroupStateImpl has been marked removed and no timeout timestamp is
specified, onIteratorCompletion does the following:

1. Requests the StateManager (of the parent FlatMapGroupsWithStateExec operator) to


remove the state (from the StateStore for the key row of the given StateData )

2. Increments the numUpdatedStateRows performance metric

222
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator

Otherwise, when the GroupStateImpl has not been marked removed or the timeout
timestamp is specified, onIteratorCompletion checks whether the timeout timestamp has
changed by comparing the timeout timestamps of the GroupStateImpl and the given
StateData .

(only when the GroupStateImpl has been updated, removed or the timeout timestamp
changed) onIteratorCompletion does the following:

1. Requests the StateManager (of the parent FlatMapGroupsWithStateExec operator) to


persist the state (in the StateStore with the key row, updated state object, and the
timeout timestamp of the given StateData )

2. Increments the numUpdatedStateRows performance metrics

onIteratorCompletion is used exclusively when InputProcessor is requested


Note
to callFunctionAndUpdateState (right after rows have been processed)

Internal Properties

223
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator

Name Description

A state key converter (of type InternalRow ⇒ Any ) to


deserialize a given row (for a per-group state key) to the
current state value
The deserialization expression for keys is specified as
the key deserializer expression when the parent
getKeyObj FlatMapGroupsWithStateExec operator is created

The data type of state keys is specified as the grouping


attributes when the parent FlatMapGroupsWithStateExec
operator is created
Used exclusively when InputProcessor is requested to
callFunctionAndUpdateState.

A output value converter (of type Any ⇒ InternalRow ) to


wrap a given output value (from the user-defined state
function) to a row

getOutputRow
The data type of the row is specified as the data type of
the output object attribute when the parent
FlatMapGroupsWithStateExec operator is created

Used exclusively when InputProcessor is requested to


callFunctionAndUpdateState.

A state value converter (of type InternalRow ⇒ Any ) to


deserialize a given row (for a per-group state value) to a
Scala value
The deserialization expression for values is specified as
the value deserializer expression when the parent
getValueObj FlatMapGroupsWithStateExec operator is created

The data type of state values is specified as the data


attributes when the parent FlatMapGroupsWithStateExec
operator is created
Used exclusively when InputProcessor is requested to
callFunctionAndUpdateState.

numOutputRows numOutputRows performance metric

224
DataStreamReader

DataStreamReader — Loading Data from


Streaming Source
DataStreamReader is the interface to describe how data is loaded to a streaming Dataset

from a streaming source.

Table 1. DataStreamReader’s Methods


Method Description

csv(path: String): DataFrame


csv
Sets csv as the format of the data source

format(source: String): DataStreamReader

format
Specifies the format of the data source
The format is used internally as the name (alias) of the streaming
source to use to load the data

json(path: String): DataFrame


json
Sets json as the format of the data source

load(): DataFrame
load(path: String): DataFrame (1)

load 1. Explicit path (that could also be specified as an option)

Creates a streaming DataFrame that represents "loading"


streaming data (and is internally a logical plan with a
StreamingRelationV2 or StreamingRelation leaf logical operators)

option(key: String, value: Boolean): DataStreamReader


option(key: String, value: Double): DataStreamReader
option(key: String, value: Long): DataStreamReader
option option(key: String, value: String): DataStreamReader

Sets a loading option

options(options: Map[String, String]): DataStreamReader

225
DataStreamReader

options Specifies the configuration options of a data source

You could use option method if you prefer specifying


Note
the options one by one or there is only one in use.

orc(path: String): DataFrame


orc
Sets orc as the format of the data source

parquet(path: String): DataFrame


parquet
Sets parquet as the format of the data source

schema(schema: StructType): DataStreamReader


schema(schemaString: String): DataStreamReader (1)

schema
1. Uses a DDL-formatted table schema
Specifies the user-defined schema of the streaming data source
(as a StructType or DDL-formatted table schema, e.g. a INT, b
STRING )

text(path: String): DataFrame


text
Sets text as the format of the data source

textFile textFile(path: String): Dataset[String]

Figure 1. DataStreamReader and The Others


DataStreamReader is used for a Spark developer to describe how Spark Structured

Streaming loads datasets from a streaming source (that in the end creates a logical plan for
a streaming query).

DataStreamReader is the Spark developer-friendly API to create a


Note StreamingRelation logical operator (that represents a streaming source in a
logical plan).

You can access DataStreamReader using SparkSession.readStream method.

226
DataStreamReader

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...

val streamReader = spark.readStream

DataStreamReader supports many source formats natively and offers the interface to define

custom formats:

json

csv

parquet

text

DataStreamReader assumes parquet file format by default that you can change
Note
using spark.sql.sources.default property.

Note hive source format is not supported.

After you have described the streaming pipeline to read datasets from an external
streaming data source, you eventually trigger the loading using format-agnostic load or
format-specific (e.g. json, csv) operators.

Table 2. DataStreamReader’s Internal Properties (in alphabetical order)


Name Initial Value Description

source
spark.sql.sources.default Source format of datasets in a
property streaming data source

userSpecifiedSchema (empty) Optional user-defined schema

Collection of key-value
extraOptions (empty)
configuration options

Specifying Loading Options —  option Method

option(key: String, value: String): DataStreamReader


option(key: String, value: Boolean): DataStreamReader
option(key: String, value: Long): DataStreamReader
option(key: String, value: Double): DataStreamReader

option family of methods specifies additional options to a streaming data source.

227
DataStreamReader

There is support for values of String , Boolean , Long , and Double types for user
convenience, and internally are converted to String type.

Internally, option sets extraOptions internal property.

You can also set options in bulk using options method. You have to do the type
Note
conversion yourself, though.

Creating Streaming Dataset (to Represent Loading Data


From Streaming Source) —  load Method

load(): DataFrame
load(path: String): DataFrame (1)

1. Specifies path option before passing the call to parameterless load()

load …​FIXME

Built-in Formats

json(path: String): DataFrame


csv(path: String): DataFrame
parquet(path: String): DataFrame
text(path: String): DataFrame
textFile(path: String): Dataset[String] (1)

1. Returns Dataset[String] not DataFrame

DataStreamReader can load streaming datasets from data sources of the following formats:

json

csv

parquet

text

The methods simply pass calls to format followed by load(path).

228
DataStreamWriter

DataStreamWriter — Writing Datasets To
Streaming Sink
DataStreamWriter is the interface to describe when and what rows of a streaming query are

sent out to the streaming sink.

DataStreamWriter is available using Dataset.writeStream method (on a streaming query).

import org.apache.spark.sql.streaming.DataStreamWriter
import org.apache.spark.sql.Row

val streamingQuery: Dataset[Long] = ...

assert(streamingQuery.isStreaming)

val writer: DataStreamWriter[Row] = streamingQuery.writeStream

Table 1. DataStreamWriter’s Methods


Method Description

foreach(writer: ForeachWriter[T]): DataStreamWriter[T]


foreach

Sets ForeachWriter in the full control of streaming writes

foreachBatch(
function: (Dataset[T], Long) => Unit): DataStreamWriter[T]

(New in 2.4.0) Sets the source to foreachBatch and the


foreachBatchWriter to the given function.
As per SPARK-24565 Add API for in Structured Streaming for
exposing output rows of each microbatch as a DataFrame, the
foreachBatch purpose of the method is to expose the micro-batch output as a
dataframe for the following:

Pass the output rows of each batch to a library that is


designed for the batch jobs only
Reuse batch data sources for output whose streaming
version does not exist
Multi-writes where the output rows are written to multiple
outputs by writing twice for every batch

format(source: String): DataStreamWriter[T]

229
DataStreamWriter

format Specifies the format of the data sink (aka output format)
The format is used internally as the name (alias) of the streaming
sink to use to write the data to

option(key: String, value: Boolean): DataStreamWriter[T]


option(key: String, value: Double): DataStreamWriter[T]
option option(key: String, value: Long): DataStreamWriter[T]
option(key: String, value: String): DataStreamWriter[T]

options(options: Map[String, String]): DataStreamWriter[T]

options Specifies the configuration options of a data sink

You could use option method if you prefer specifying


Note
the options one by one or there is only one in use.

outputMode(outputMode: OutputMode): DataStreamWriter[T]


outputMode(outputMode: String): DataStreamWriter[T]
outputMode

Specifies the output mode

partitionBy partitionBy(colNames: String*): DataStreamWriter[T]

queryName(queryName: String): DataStreamWriter[T]


queryName

Assigns the name of a query

start(): StreamingQuery
start(path: String): StreamingQuery (1)

start
1. Explicit path (that could also be specified as an option)

Creates and immediately starts a StreamingQuery

trigger(trigger: Trigger): DataStreamWriter[T]

trigger
Sets the Trigger for how often a streaming query should be
executed and the result saved.

230
DataStreamWriter

A streaming query is a Dataset with a streaming logical plan.

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
import org.apache.spark.sql.DataFrame
val rates: DataFrame = spark.
readStream.
format("rate").
Note load

scala> rates.isStreaming
res1: Boolean = true

scala> rates.queryExecution.logical.isStreaming
res2: Boolean = true

Like the batch DataFrameWriter , DataStreamWriter has a direct support for many file
formats and an extension point to plug in new formats.

// see above for writer definition

// Save dataset in JSON format


writer.format("json")

In the end, you start the actual continuous writing of the result of executing a Dataset to a
sink using start operator.

writer.save

Beside the above operators, there are the following to work with a Dataset as a whole.

hive is not supported for streaming writing (and leads to a


Note
AnalysisException ).

Note DataFrameWriter is responsible for writing in a batch fashion.

Specifying Write Option —  option Method

option(key: String, value: String): DataStreamWriter[T]


option(key: String, value: Boolean): DataStreamWriter[T]
option(key: String, value: Long): DataStreamWriter[T]
option(key: String, value: Double): DataStreamWriter[T]

Internally, option adds the key and value to extraOptions internal option registry.

231
DataStreamWriter

Specifying Output Mode —  outputMode Method

outputMode(outputMode: String): DataStreamWriter[T]


outputMode(outputMode: OutputMode): DataStreamWriter[T]

outputMode specifies the output mode of a streaming query, i.e. what data is sent out to a

streaming sink when there is new data available in streaming data sources.

Note When not defined explicitly, outputMode defaults to Append output mode.

outputMode can be specified by name or one of the OutputMode values.

Setting Query Name —  queryName method

queryName(queryName: String): DataStreamWriter[T]

queryName sets the name of a streaming query.

Internally, it is just an additional option with the key queryName .

Setting How Often to Execute Streaming Query 


—  trigger method

trigger(trigger: Trigger): DataStreamWriter[T]

trigger method sets the time interval of the trigger (that executes a batch runner) for a

streaming query.

Trigger specifies how often results should be produced by a StreamingQuery.


Note
See Trigger.

The default trigger is ProcessingTime(0L) that runs a streaming query as often as possible.

Tip Consult Trigger to learn about Trigger and ProcessingTime types.

Creating and Starting Execution of Streaming Query 


—  start Method

start(): StreamingQuery
start(path: String): StreamingQuery (1)

232
DataStreamWriter

1. Sets path option to path and passes the call on to start()

start starts a streaming query.

start gives a StreamingQuery to control the execution of the continuous query.

Whether or not you have to specify path option depends on the streaming sink
Note
in use.

Internally, start branches off per source .

memory

foreach

other formats

…​FIXME

Table 2. start’s Options


Option Description
queryName Name of active streaming query

Directory for checkpointing (and to store query metadata


checkpointLocation like offsets before and after being processed, the query
id, etc.)

start reports a AnalysisException when source is hive .

val q = spark.
readStream.
text("server-logs/*").
writeStream.
format("hive") <-- hive format used as a streaming sink
scala> q.start
org.apache.spark.sql.AnalysisException: Hive data source can only be used with tables,
you can not write files of Hive data source directly.;
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:234)
... 48 elided

Note Define options using option or options methods.

Making ForeachWriter in Charge of Streaming Writes 


—  foreach method

foreach(writer: ForeachWriter[T]): DataStreamWriter[T]

233
DataStreamWriter

foreach sets the input ForeachWriter to be in control of streaming writes.

Internally, foreach sets the streaming output format as foreach and foreachWriter as the
input writer .

foreach uses SparkSession to access SparkContext to clean the


Note
ForeachWriter .

foreach reports an IllegalArgumentException when writer is null .

Note foreach writer cannot be null

Internal Properties
Initial
Name Description
Value
extraOptions

foreachBatchWriter: (Dataset[T], Long) => Unit

foreachBatchWriter null
The function that is used as the batch writer in the
ForeachBatchSink for foreachBatch

foreachWriter

partitioningColumns

source

OutputMode of the streaming sink


outputMode Append
Set using outputMode method.

trigger

234
OutputMode

OutputMode
Output mode ( OutputMode ) of a streaming query describes what data is written to a
streaming sink.

There are three available output modes:

Append

Complete

Update

The output mode is specified on the writing side of a streaming query using
DataStreamWriter.outputMode method (by alias or a value of
org.apache.spark.sql.streaming.OutputMode object).

import org.apache.spark.sql.streaming.OutputMode.Update
val inputStream = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.outputMode(Update) // <-- update output mode
.start

Append Output Mode


Append (alias: append) is the default output mode that writes "new" rows only.

In streaming aggregations, a "new" row is when the intermediate state becomes final, i.e.
when new events for the grouping key can only be considered late which is when watermark
moves past the event time of the key.

Append output mode requires that a streaming query defines event-time watermark (using

withWatermark operator) on the event time column that is used in aggregation (directly or
using window function).

Required for datasets with FileFormat format (to create FileStreamSink)

Append is mandatory when multiple flatMapGroupsWithState operators are used in a

structured query.

235
OutputMode

Complete Output Mode


Complete (alias: complete) writes all the rows of a Result Table (and corresponds to a
traditional batch structured query).

Complete mode does not drop old aggregation state and preserves all data in the Result
Table.

Supported only for streaming aggregations (as asserted by UnsupportedOperationChecker).

Update Output Mode


Update (alias: update) writes only the rows that were updated (every time there are
updates).

For queries that are not streaming aggregations, Update is equivalent to the Append output
mode.

236
Trigger

Trigger — How Frequently to Check Sources


For New Data
Trigger defines how often a streaming query should be executed (triggered) and emit a

new data (which StreamExecution uses to resolve a TriggerExecutor).

Table 1. Trigger’s Factory Methods


Trigger Creating Instance

Trigger Continuous(long intervalMs)


ContinuousTrigger Trigger Continuous(long interval, TimeUnit timeUnit)
Trigger Continuous(Duration interval)
Trigger Continuous(String interval)

OneTimeTrigger Trigger Once()

Trigger ProcessingTime(Duration interval)


Trigger ProcessingTime(long intervalMs)
Trigger ProcessingTime(long interval, TimeUnit timeU
ProcessingTime nit)
Trigger ProcessingTime(String interval)

Examples of ProcessingTime

You specify the trigger for a streaming query using DataStreamWriter 's trigger
Note
method.

237
Trigger

import org.apache.spark.sql.streaming.Trigger
val query = spark.
readStream.
format("rate").
load.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.Once). // <-- execute once and stop
queryName("rate-once").
start

assert(query.isActive == false)

scala> println(query.lastProgress)
{
"id" : "2ae4b0a4-434f-4ca7-a523-4e859c07175b",
"runId" : "24039ce5-906c-4f90-b6e7-bbb3ec38a1f5",
"name" : "rate-once",
"timestamp" : "2017-07-04T18:39:35.998Z",
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 1365,
"getBatch" : 29,
"getOffset" : 0,
"queryPlanning" : 285,
"triggerExecution" : 1742,
"walCommit" : 40
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]"
,
"startOffset" : null,
"endOffset" : 0,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@7dbf277"
}
}

Although Trigger allows for custom implementations, StreamExecution


Note
refuses such attempts and reports an IllegalStateException .

238
Trigger

import org.apache.spark.sql.streaming.Trigger
case object MyTrigger extends Trigger
scala> val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.trigger(MyTrigger) // <-- use custom trigger
.queryName("rate-custom-trigger")
.start
java.lang.IllegalStateException: Unknown type of trigger: MyTrigger
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.<init>(MicroBatchExe
cution.scala:60)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryMa
nager.scala:275)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryMan
ager.scala:316)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:325)
... 57 elided

Trigger was introduced in the commit for [SPARK-14176][SQL] Add


Note
DataFrameWriter.trigger to set the stream batch period.

Examples of ProcessingTime
ProcessingTime is a Trigger that assumes that milliseconds is the minimum time unit.

You can create an instance of ProcessingTime using the following constructors:

ProcessingTime(Long) that accepts non-negative values that represent milliseconds.

ProcessingTime(10)

ProcessingTime(interval: String) or ProcessingTime.create(interval: String) that

accept CalendarInterval instances with or without leading interval string.

ProcessingTime("10 milliseconds")
ProcessingTime("interval 10 milliseconds")

ProcessingTime(Duration) that accepts scala.concurrent.duration.Duration instances.

ProcessingTime(10.seconds)

239
Trigger

ProcessingTime.create(interval: Long, unit: TimeUnit) for Long and

java.util.concurrent.TimeUnit instances.

ProcessingTime.create(10, TimeUnit.SECONDS)

240
StreamingQuery

StreamingQuery Contract
StreamingQuery is the contract of streaming queries that are executed continuously and

concurrently (i.e. on a separate thread).

Note StreamingQuery is called continuous query or streaming query.

StreamingQuery is a Scala trait with the only implementation being


Note StreamExecution (and less importanly StreamingQueryWrapper for serializing a
non-serializable StreamExecution ).

Table 1. StreamingQuery Contract


Method Description

awaitTermination(): Unit
awaitTermination(timeoutMs: Long): Boolean
awaitTermination

Used when…​FIXME

exception: Option[StreamingQueryException]

exception
StreamingQueryException if the query has finished due to
an exception
Used when…​FIXME

explain(): Unit
explain(extended: Boolean): Unit
explain

Used when…​FIXME

id: UUID

id
The unique identifier of the streaming query (that does
not change across restarts unlike runId)
Used when…​FIXME

isActive: Boolean

isActive Indicates whether the streaming query is active ( true )


or not ( false )

241
StreamingQuery

Used when…​FIXME

lastProgress: StreamingQueryProgress

lastProgress
The last StreamingQueryProgress of the streaming query
Used when…​FIXME

name: String

name
The name of the query that is unique across all active
queries
Used when…​FIXME

processAllAvailable(): Unit

processAllAvailable Pauses (blocks) the current thread until the streaming


query has no more data to be processed or has been
stopped
Intended for testing

recentProgress: Array[StreamingQueryProgress]

recentProgress
Collection of the recent StreamingQueryProgress
updates.
Used when…​FIXME

runId: UUID

runId
The unique identifier of the current execution of the
streaming query (that is different every restart unlike id)
Used when…​FIXME

sparkSession: SparkSession
sparkSession

Used when…​FIXME

status: StreamingQueryStatus

242
StreamingQuery

status StreamingQueryStatus of the streaming query (as


StreamExecution has accumulated being a
ProgressReporter while running the streaming query)

Used when…​FIXME

stop(): Unit
stop

Stops the streaming query

StreamingQuery can be in two states:

active (started)

inactive (stopped)

If inactive, StreamingQuery may have transitioned into the state due to an


StreamingQueryException (that is available under exception ).

StreamingQuery tracks current state of all the sources, i.e. SourceStatus , as

sourceStatuses .

There could only be a single Sink for a StreamingQuery with many Sources.

StreamingQuery can be stopped by stop or an exception.

243
Streaming Operators

Streaming Operators — High-Level Declarative


Streaming Dataset API
Dataset API comes with a set of operators that are of particular use in Spark Structured
Streaming that together constitute so-called High-Level Declarative Streaming Dataset
API.

Table 1. Streaming Operators


Operator Description

crossJoin(
crossJoin right: Dataset[_]): DataFrame

dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates dropDuplicates(col1: String, cols: String*): Dataset[T]

Drops duplicate records (given a subset of columns)

explain(): Unit
explain(extended: Boolean): Unit
explain

Explains query plans

groupBy(cols: Column*): RelationalGroupedDataset


groupBy(col1: String, cols: String*): RelationalGrouped
groupBy Dataset

Aggregates rows by zero, one or more columns

groupByKey(func: T => K): KeyValueGroupedDataset[K, T]

groupByKey
Aggregates rows by a typed grouping function (and gives a
KeyValueGroupedDataset)

244
Streaming Operators

join(
right: Dataset[_]): DataFrame
join(
right: Dataset[_],
join joinExprs: Column): DataFrame
join(
right: Dataset[_],
joinExprs: Column,
joinType: String): DataFrame
join(
right: Dataset[_],
usingColumns: Seq[String]): DataFrame
join(
right: Dataset[_],
usingColumns: Seq[String],
joinType: String): DataFrame
join(
right: Dataset[_],
usingColumn: String): DataFrame

joinWith[U](
other: Dataset[U],
condition: Column): Dataset[(T, U)]
joinWith joinWith[U](
other: Dataset[U],
condition: Column,
joinType: String): Dataset[(T, U)]

withWatermark(
eventTime: String,
delayThreshold: String): Dataset[T]
withWatermark

Defines a streaming watermark (on the given eventTime


column with a delay threshold)

writeStream: DataStreamWriter[T]

writeStream
Creates a DataStreamWriter for persisting the result of a
streaming query to an external data system

245
Streaming Operators

val rates = spark


.readStream
.format("rate")
.option("rowsPerSecond", 1)
.load

// stream processing
// replace [operator] with the operator of your choice
rates.[operator]

// output stream
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val sq = rates
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Complete)
.queryName("rate-console")
.start

// eventually...
sq.stop

246
dropDuplicates Operator

dropDuplicates Operator — Streaming
Deduplication
dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]

dropDuplicates operator…​FIXME

For a streaming Dataset, dropDuplicates will keep all data across triggers as
intermediate state to drop duplicates rows. You can use withWatermark
Note operator to limit how late the duplicate data can be and system will accordingly
limit the state. In addition, too late data older than watermark will be dropped to
avoid any possibility of duplicates.

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

// Start a streaming query


// Using old-fashioned MemoryStream (with the deprecated SQLContext)
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.SQLContext
implicit val sqlContext: SQLContext = spark.sqlContext
val source = MemoryStream[(Int, Int)]
val ids = source.toDS.toDF("time", "id").
withColumn("time", $"time" cast "timestamp"). // <-- convert time column from Int to
Timestamp
dropDuplicates("id").
withColumn("time", $"time" cast "long") // <-- convert time column back from Timest
amp to Int

// Conversions are only for display purposes


// Internally we need timestamps for watermark to work
// Displaying timestamps could be too much for such a simple task

scala> println(ids.queryExecution.analyzed.numberedTreeString)
00 Project [cast(time#10 as bigint) AS time#15L, id#6]
01 +- Deduplicate [id#6], true
02 +- Project [cast(time#5 as timestamp) AS time#10, id#6]
03 +- Project [_1#2 AS time#5, _2#3 AS id#6]
04 +- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2, _2#3]

import org.apache.spark.sql.streaming.{OutputMode, Trigger}


import scala.concurrent.duration._
val q = ids.
writeStream.
format("memory").

247
dropDuplicates Operator

queryName("dups").
outputMode(OutputMode.Append).
trigger(Trigger.ProcessingTime(30.seconds)).
option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save sta
te between restarts
start

// Publish duplicate records


source.addData(1 -> 1)
source.addData(2 -> 1)
source.addData(3 -> 1)

q.processAllAvailable()

// Check out how dropDuplicates removes duplicates


// --> per single streaming batch (easy)
scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
| 1| 1|
+----+---+

source.addData(4 -> 1)
source.addData(5 -> 2)

// --> across streaming batches (harder)


scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
| 1| 1|
| 5| 2|
+----+---+

// Check out the internal state


scala> println(q.lastProgress.stateOperators(0).prettyJson)
{
"numRowsTotal" : 2,
"numRowsUpdated" : 1,
"memoryUsedBytes" : 17751
}

// You could use web UI's SQL tab instead


// Use Details for Query

source.addData(6 -> 2)

scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
| 1| 1|

248
dropDuplicates Operator

| 5| 2|
+----+---+

// Check out the internal state


scala> println(q.lastProgress.stateOperators(0).prettyJson)
{
"numRowsTotal" : 2,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 17751
}

// Restart the streaming query


q.stop

val q = ids.
writeStream.
format("memory").
queryName("dups").
outputMode(OutputMode.Complete). // <-- memory sink supports checkpointing for Comp
lete output mode only
trigger(Trigger.ProcessingTime(30.seconds)).
option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save sta
te between restarts
start

// Doh! MemorySink is fine, but Complete is only available with a streaming aggregation

// Answer it if you know why --> https://stackoverflow.com/q/45756997/1305344

// It's a high time to work on https://issues.apache.org/jira/browse/SPARK-21667


// to understand the low-level details (and the reason, it seems)

// Disabling operation checks and starting over


// ./bin/spark-shell -c spark.sql.streaming.unsupportedOperationCheck=false
// it works now --> no exception!

scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
+----+---+

source.addData(0 -> 1)
// wait till the batch is triggered
scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
| 0| 1|
+----+---+

source.addData(1 -> 1)
source.addData(2 -> 1)

249
dropDuplicates Operator

// wait till the batch is triggered


scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
+----+---+

// What?! No rows?! It doesn't look as if it worked fine :(

// Use groupBy to pass the requirement of having streaming aggregation for Complete ou
tput mode
val counts = ids.groupBy("id").agg(first($"time") as "first_time")
scala> counts.explain
== Physical Plan ==
*HashAggregate(keys=[id#246], functions=[first(time#255L, false)])
+- StateStoreSave [id#246], StatefulOperatorStateInfo(<unknown>,3585583b-42d7-4547-8d62
-255581c48275,0,0), Append, 0
+- *HashAggregate(keys=[id#246], functions=[merge_first(time#255L, false)])
+- StateStoreRestore [id#246], StatefulOperatorStateInfo(<unknown>,3585583b-42d7
-4547-8d62-255581c48275,0,0)
+- *HashAggregate(keys=[id#246], functions=[merge_first(time#255L, false)])
+- *HashAggregate(keys=[id#246], functions=[partial_first(time#255L, false
)])
+- *Project [cast(time#250 as bigint) AS time#255L, id#246]
+- StreamingDeduplicate [id#246], StatefulOperatorStateInfo(<unknown
>,3585583b-42d7-4547-8d62-255581c48275,1,0), 0
+- Exchange hashpartitioning(id#246, 200)
+- *Project [cast(_1#242 as timestamp) AS time#250, _2#243 AS
id#246]
+- StreamingRelation MemoryStream[_1#242,_2#243], [_1#242,
_2#243]
val q = counts.
writeStream.
format("memory").
queryName("dups").
outputMode(OutputMode.Complete). // <-- memory sink supports checkpointing for Comp
lete output mode only
trigger(Trigger.ProcessingTime(30.seconds)).
option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save sta
te between restarts
start

source.addData(0 -> 1)
source.addData(1 -> 1)
// wait till the batch is triggered
scala> spark.table("dups").show
+---+----------+
| id|first_time|
+---+----------+
| 1| 0|
+---+----------+

// Publish duplicates

250
dropDuplicates Operator

// Check out how dropDuplicates removes duplicates

// Stop the streaming query


// Specify event time watermark to remove old duplicates

251
explain Operator

Dataset.explain High-Level Operator — 


Explaining Streaming Query Plans
explain(): Unit (1)
explain(extended: Boolean): Unit

1. Calls explain with extended flag disabled

Dataset.explain is a high-level operator that prints the logical and (with extended flag

enabled) physical plans to the console.

val records = spark.


readStream.
format("rate").
load
scala> records.explain
== Physical Plan ==
StreamingRelation rate, [timestamp#0, value#1L]

scala> records.explain(extended = true)


== Parsed Logical Plan ==
StreamingRelation DataSource(org.apache.spark.sql.SparkSession@4071aa13,rate,List(),No
ne,List(),None,Map(),None), rate, [timestamp#0, value#1L]

== Analyzed Logical Plan ==


timestamp: timestamp, value: bigint
StreamingRelation DataSource(org.apache.spark.sql.SparkSession@4071aa13,rate,List(),No
ne,List(),None,Map(),None), rate, [timestamp#0, value#1L]

== Optimized Logical Plan ==


StreamingRelation DataSource(org.apache.spark.sql.SparkSession@4071aa13,rate,List(),No
ne,List(),None,Map(),None), rate, [timestamp#0, value#1L]

== Physical Plan ==
StreamingRelation rate, [timestamp#0, value#1L]

Internally, explain creates a ExplainCommand runnable command with the logical plan and
extended flag.

explain then executes the plan with ExplainCommand runnable command and collects the

results that are printed out to the standard output.

252
explain Operator

explain uses SparkSession to access the current SessionState to execute


the plan.

Note import org.apache.spark.sql.execution.command.ExplainCommand


val explain = ExplainCommand(...)
spark.sessionState.executePlan(explain)

For streaming Datasets, ExplainCommand command simply creates a IncrementalExecution


for the SparkSession and the logical plan.

For the purpose of explain , IncrementalExecution is created with the output


mode Append , checkpoint location <unknown> , run id a random number,
Note
current batch id 0 and offset metadata empty. They do not really matter when
explaining the load-part of a streaming query.

253
groupBy Operator

groupBy Operator — Untyped Streaming


Aggregation (with Implicit State Logic)
groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset

groupBy operator…​FIXME

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

// Since I'm with SNAPSHOT


// Remember to remove ~/.ivy2/cache/org.apache.spark
// Make sure that ~/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.3.0-SNAPSH
OT.jar is the latest
// Start spark-shell as follows
/**
./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPSHOT
*/

val fromTopic1 = spark.


readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load

// extract event time et al


// time,key,value
/*
2017-08-23T00:00:00.002Z,1,now
2017-08-23T00:05:00.002Z,1,5 mins later
2017-08-23T00:09:00.002Z,1,9 mins later
2017-08-23T00:11:00.002Z,1,11 mins later
2017-08-23T01:00:00.002Z,1,1 hour later
// late event = watermark should be (1 hour - 10 minutes) already
2017-08-23T00:49:59.002Z,1,==> SHOULD NOT BE INCLUDED in aggregation as too late <==

CAUTION: FIXME SHOULD NOT BE INCLUDED is included contrary to my understanding?!


*/
val timedValues = fromTopic1.
select('value cast "string").
withColumn("tokens", split('value, ",")).
withColumn("time", to_timestamp('tokens(0))).
withColumn("key", 'tokens(1) cast "int").
withColumn("value", 'tokens(2)).
select("time", "key", "value")

254
groupBy Operator

// aggregation with watermark


val counts = timedValues.
withWatermark("time", "10 minutes").
groupBy("key").
agg(collect_list('value) as "values", collect_list('time) as "times")

// Note that StatefulOperatorStateInfo is mostly generic


// since no batch-specific values are currently available
// only after the first streaming batch
scala> counts.explain
== Physical Plan ==
ObjectHashAggregate(keys=[key#27], functions=[collect_list(value#33, 0, 0), collect_li
st(time#22-T600000ms, 0, 0)])
+- Exchange hashpartitioning(key#27, 200)
+- StateStoreSave [key#27], StatefulOperatorStateInfo(<unknown>,25149816-1f14-4901-
af13-896286a26d42,0,0), Append, 0
+- ObjectHashAggregate(keys=[key#27], functions=[merge_collect_list(value#33, 0,
0), merge_collect_list(time#22-T600000ms, 0, 0)])
+- Exchange hashpartitioning(key#27, 200)
+- StateStoreRestore [key#27], StatefulOperatorStateInfo(<unknown>,25149816
-1f14-4901-af13-896286a26d42,0,0)
+- ObjectHashAggregate(keys=[key#27], functions=[merge_collect_list(val
ue#33, 0, 0), merge_collect_list(time#22-T600000ms, 0, 0)])
+- Exchange hashpartitioning(key#27, 200)
+- ObjectHashAggregate(keys=[key#27], functions=[partial_collect_
list(value#33, 0, 0), partial_collect_list(time#22-T600000ms, 0, 0)])
+- EventTimeWatermark time#22: timestamp, interval 10 minutes
+- *Project [cast(split(cast(value#1 as string), ,)[0] as t
imestamp) AS time#22, cast(split(cast(value#1 as string), ,)[1] as int) AS key#27, spl
it(cast(value#1 as string), ,)[2] AS value#33]
+- StreamingRelation kafka, [key#0, value#1, topic#2, pa
rtition#3, offset#4L, timestamp#5, timestampType#6]

import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
val sq = counts.writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(30.seconds)).
outputMode(OutputMode.Update). // <-- only Update or Complete acceptable because of
groupBy aggregation
start

// After StreamingQuery was started,


// the physical plan is complete (with batch-specific values)
scala> sq.explain
== Physical Plan ==
ObjectHashAggregate(keys=[key#27], functions=[collect_list(value#33, 0, 0), collect_li
st(time#22-T600000ms, 0, 0)])
+- Exchange hashpartitioning(key#27, 200)
+- StateStoreSave [key#27], StatefulOperatorStateInfo(file:/private/var/folders/0w/
kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/temporary-635d6519-b6ca-4686-9b6b-5db0e83cfd51/state,

255
groupBy Operator

855cec1c-25dc-4a86-ae54-c6cdd4ed02ec,0,0), Update, 0
+- ObjectHashAggregate(keys=[key#27], functions=[merge_collect_list(value#33, 0,
0), merge_collect_list(time#22-T600000ms, 0, 0)])
+- Exchange hashpartitioning(key#27, 200)
+- StateStoreRestore [key#27], StatefulOperatorStateInfo(file:/private/var
/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/temporary-635d6519-b6ca-4686-9b6b-5db0e83
cfd51/state,855cec1c-25dc-4a86-ae54-c6cdd4ed02ec,0,0)
+- ObjectHashAggregate(keys=[key#27], functions=[merge_collect_list(val
ue#33, 0, 0), merge_collect_list(time#22-T600000ms, 0, 0)])
+- Exchange hashpartitioning(key#27, 200)
+- ObjectHashAggregate(keys=[key#27], functions=[partial_collect_
list(value#33, 0, 0), partial_collect_list(time#22-T600000ms, 0, 0)])
+- EventTimeWatermark time#22: timestamp, interval 10 minutes
+- *Project [cast(split(cast(value#76 as string), ,)[0] as
timestamp) AS time#22, cast(split(cast(value#76 as string), ,)[1] as int) AS key#27, s
plit(cast(value#76 as string), ,)[2] AS value#33]
+- Scan ExistingRDD[key#75,value#76,topic#77,partition#78
,offset#79L,timestamp#80,timestampType#81]

256
groupByKey Operator

groupByKey Operator — Streaming
Aggregation
Introduction

Example: Aggregating Orders Per Zip Code

Example: Aggregating Metrics Per Device

Introduction

groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T]

groupByKey operator creates a KeyValueGroupedDataset (with keys of type K and rows of

type T ) to apply aggregation functions over groups of rows (of type T ) by key (of type K )
per the given func key-generating function.

The type of the input argument of func is the type of rows in the Dataset (i.e.
Note
Dataset[T] ).

groupByKey simply applies the func function to every row (of type T ) and associates it

with a logical group per key (of type K ).

func: T => K

Internally, groupByKey creates a structured query with the AppendColumns unary logical
operator (with the given func and the analyzed logical plan of the target Dataset that
groupByKey was executed on) and creates a new QueryExecution .

In the end, groupByKey creates a KeyValueGroupedDataset with the following:

Encoders for K keys and T rows

The new QueryExecution (with the AppendColumns unary logical operator)

The output schema of the analyzed logical plan

The new columns of the AppendColumns logical operator (i.e. the attributes of the key)

257
groupByKey Operator

scala> :type sq
org.apache.spark.sql.Dataset[Long]

val baseCode = 'A'.toInt


val byUpperChar = (n: java.lang.Long) => (n % 3 + baseCode).toString
val kvs = sq.groupByKey(byUpperChar)

scala> :type kvs


org.apache.spark.sql.KeyValueGroupedDataset[String,Long]

// Peeking under the surface of KeyValueGroupedDataset


import org.apache.spark.sql.catalyst.plans.logical.AppendColumns
val appendColumnsOp = kvs.queryExecution.analyzed.collect { case ac: AppendColumns =>
ac }.head
scala> println(appendColumnsOp.newColumns)
List(value#7)

Example: Aggregating Orders Per Zip Code


Go to Demo: groupByKey Streaming Aggregation in Update Mode.

Example: Aggregating Metrics Per Device


The following example code shows how to apply groupByKey operator to a structured
stream of timestamped values of different devices.

// input stream
import java.sql.Timestamp
val signals = spark.
readStream.
format("rate").
option("rowsPerSecond", 1).
load.
withColumn("value", $"value" % 10) // <-- randomize the values (just for fun)
withColumn("deviceId", lit(util.Random.nextInt(10))). // <-- 10 devices randomly ass
igned to values
as[(Timestamp, Long, Int)] // <-- convert to a "better" type (from "unpleasant" Row)

// stream processing using groupByKey operator


// groupByKey(func: ((Timestamp, Long, Int)) => K): KeyValueGroupedDataset[K, (Timesta
mp, Long, Int)]
// K becomes Int which is a device id
val deviceId: ((Timestamp, Long, Int)) => Int = { case (_, _, deviceId) => deviceId }
scala> val signalsByDevice = signals.groupByKey(deviceId)
signalsByDevice: org.apache.spark.sql.KeyValueGroupedDataset[Int,(java.sql.Timestamp,
Long, Int)] = org.apache.spark.sql.KeyValueGroupedDataset@19d40bc6

258
groupByKey Operator

259
withWatermark Operator

withWatermark Operator — Event-Time
Watermark
withWatermark(eventTime: String, delayThreshold: String): Dataset[T]

withWatermark specifies the eventTime column for event time watermark and

delayThreshold for event lateness.

eventTime specifies the column to use for watermark and can be either part of Dataset

from the source or custom-generated using current_time or current_timestamp functions.

Watermark tracks a point in time before which it is assumed no more late


Note events are supposed to arrive (and if they have, the late events are considered
really late and simply dropped).

Spark Structured Streaming uses watermark for the following:


To know when a given time window aggregation (using groupBy operator
with window function) can be finalized and thus emitted when using output
modes that do not allow updates, like Append output mode.
Note
To minimize the amount of state that we need to keep for ongoing
aggregations, e.g. mapGroupsWithState (for implicit state management),
flatMapGroupsWithState (for user-defined state management) and
dropDuplicates operators.

The current watermark is computed by looking at the maximum eventTime seen across all
of the partitions in a query minus a user-specified delayThreshold . Due to the cost of
coordinating this value across partitions, the actual watermark used is only guaranteed to be
at least delayThreshold behind the actual event time.

In some cases Spark may still process records that arrive more than
Note
delayThreshold late.

260
window Function

window Function — Stream Time Windows


window is a standard function that generates tumbling, sliding or delayed stream time

window ranges (on a timestamp column).

window(
timeColumn: Column,
windowDuration: String): Column (1)
window(
timeColumn: Column,
windowDuration: String,
slideDuration: String): Column (2)
window(
timeColumn: Column,
windowDuration: String,
slideDuration: String,
startTime: String): Column (3)

1. Creates a tumbling time window with slideDuration as windowDuration and 0 second


for startTime

2. Creates a sliding time window with 0 second for startTime

3. Creates a delayed time window

From Tumbling Window (Azure Stream Analytics):


Note Tumbling windows are a series of fixed-sized, non-overlapping and
contiguous time intervals.

From Introducing Stream Windows in Apache Flink:

Tumbling windows group elements of a stream into finite sets where each
Note set corresponds to an interval.
Tumbling windows discretize a stream into non-overlapping windows.

scala> val timeColumn = window($"time", "5 seconds")


timeColumn: org.apache.spark.sql.Column = timewindow(time, 5000000, 5000000, 0) AS `wi
ndow`

timeColumn should be of TimestampType , i.e. with java.sql.Timestamp values.

Use java.sql.Timestamp.from or java.sql.Timestamp.valueOf factory methods to


Tip
create Timestamp instances.

261
window Function

// https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html
import java.time.LocalDateTime
// https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html
import java.sql.Timestamp
val levels = Seq(
// (year, month, dayOfMonth, hour, minute, second)
((2012, 12, 12, 12, 12, 12), 5),
((2012, 12, 12, 12, 12, 14), 9),
((2012, 12, 12, 13, 13, 14), 4),
((2016, 8, 13, 0, 0, 0), 10),
((2017, 5, 27, 0, 0, 0), 15)).
map { case ((yy, mm, dd, h, m, s), a) => (LocalDateTime.of(yy, mm, dd, h, m, s), a)
}.
map { case (ts, a) => (Timestamp.valueOf(ts), a) }.
toDF("time", "level")
scala> levels.show
+-------------------+-----+
| time|level|
+-------------------+-----+
|2012-12-12 12:12:12| 5|
|2012-12-12 12:12:14| 9|
|2012-12-12 13:13:14| 4|
|2016-08-13 00:00:00| 10|
|2017-05-27 00:00:00| 15|
+-------------------+-----+

val q = levels.select(window($"time", "5 seconds"), $"level")


scala> q.show(truncate = false)
+---------------------------------------------+-----+
|window |level|
+---------------------------------------------+-----+
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|5 |
|[2012-12-12 12:12:10.0,2012-12-12 12:12:15.0]|9 |
|[2012-12-12 13:13:10.0,2012-12-12 13:13:15.0]|4 |
|[2016-08-13 00:00:00.0,2016-08-13 00:00:05.0]|10 |
|[2017-05-27 00:00:00.0,2017-05-27 00:00:05.0]|15 |
+---------------------------------------------+-----+

scala> q.printSchema
root
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- level: integer (nullable = false)

// calculating the sum of levels every 5 seconds


val sums = levels.
groupBy(window($"time", "5 seconds")).
agg(sum("level") as "level_sum").
select("window.start", "window.end", "level_sum")
scala> sums.show
+-------------------+-------------------+---------+

262
window Function

| start| end|level_sum|
+-------------------+-------------------+---------+
|2012-12-12 13:13:10|2012-12-12 13:13:15| 4|
|2012-12-12 12:12:10|2012-12-12 12:12:15| 14|
|2016-08-13 00:00:00|2016-08-13 00:00:05| 10|
|2017-05-27 00:00:00|2017-05-27 00:00:05| 15|
+-------------------+-------------------+---------+

windowDuration and slideDuration are strings specifying the width of the window for

duration and sliding identifiers, respectively.

Tip Use CalendarInterval for valid window identifiers.

There are a couple of rules governing the durations:

1. The window duration must be greater than 0

2. The slide duration must be greater than 0.

3. The start time must be greater than or equal to 0.

4. The slide duration must be less than or equal to the window duration.

5. The start time must be less than the slide duration.

Note Only one window expression is supported in a query.

Note null values are filtered out in window expression.

Internally, window creates a Column with TimeWindow Catalyst expression under window
alias.

scala> val timeColumn = window($"time", "5 seconds")


timeColumn: org.apache.spark.sql.Column = timewindow(time, 5000000, 5000000, 0) AS `wi
ndow`

val windowExpr = timeColumn.expr


scala> println(windowExpr.numberedTreeString)
00 timewindow('time, 5000000, 5000000, 0) AS window#23
01 +- timewindow('time, 5000000, 5000000, 0)
02 +- 'time

Internally, TimeWindow Catalyst expression is simply a struct type with two fields, i.e. start
and end , both of TimestampType type.

263
window Function

scala> println(windowExpr.dataType)
StructType(StructField(start,TimestampType,true), StructField(end,TimestampType,true))

scala> println(windowExpr.dataType.prettyJson)
{
"type" : "struct",
"fields" : [ {
"name" : "start",
"type" : "timestamp",
"nullable" : true,
"metadata" : { }
}, {
"name" : "end",
"type" : "timestamp",
"nullable" : true,
"metadata" : { }
} ]
}

TimeWindow time window Catalyst expression is planned (i.e. converted) in


TimeWindowing logical optimization rule (i.e. Rule[LogicalPlan] ) of the Spark
SQL logical query plan analyzer.
Note
Find more about the Spark SQL logical query plan analyzer in Mastering
Apache Spark 2 gitbook.

Example — Traffic Sensor
Note The example is borrowed from Introducing Stream Windows in Apache Flink.

The example shows how to use window function to model a traffic sensor that counts every
15 seconds the number of vehicles passing a certain location.

264
KeyValueGroupedDataset

KeyValueGroupedDataset — Streaming
Aggregation
KeyValueGroupedDataset represents a grouped dataset as a result of Dataset.groupByKey

operator (that aggregates records by a grouping function).

// Dataset[T]
groupByKey(func: T => K): KeyValueGroupedDataset[K, T]

import java.sql.Timestamp
val numGroups = spark.
readStream.
format("rate").
load.
as[(Timestamp, Long)].
groupByKey { case (time, value) => value % 2 }

scala> :type numGroups


org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]

KeyValueGroupedDataset is also created for KeyValueGroupedDataset.keyAs and

KeyValueGroupedDataset.mapValues operators.

scala> :type numGroups


org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]

scala> :type numGroups.keyAs[String]


org.apache.spark.sql.KeyValueGroupedDataset[String,(java.sql.Timestamp, Long)]

scala> :type numGroups


org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]

val mapped = numGroups.mapValues { case (ts, n) => s"($ts, $n)" }


scala> :type mapped
org.apache.spark.sql.KeyValueGroupedDataset[Long,String]

KeyValueGroupedDataset works for batch and streaming aggregations, but shines the most

when used for Streaming Aggregation.

265
KeyValueGroupedDataset

scala> :type numGroups


org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
numGroups.
mapGroups { case(group, values) => values.size }.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(10.seconds)).
start

-------------------------------------------
Batch: 0
-------------------------------------------
+-----+
|value|
+-----+
+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+-----+
|value|
+-----+
| 3|
| 2|
+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+
|value|
+-----+
| 5|
| 5|
+-----+

// Eventually...
spark.streams.active.foreach(_.stop)

The most prestigious use case of KeyValueGroupedDataset however is Arbitrary Stateful


Streaming Aggregation that allows for accumulating streaming state (by means of
GroupState) using mapGroupsWithState and the more advanced flatMapGroupsWithState
operators.

Table 1. KeyValueGroupedDataset’s Operators


Operator Description

266
KeyValueGroupedDataset

agg[U1](col1: TypedColumn[V, U1]): Dataset[(K, U1)]


agg[U1, U2](
col1: TypedColumn[V, U1],
col2: TypedColumn[V, U2]): Dataset[(K, U1, U2)]
agg[U1, U2, U3](
col1: TypedColumn[V, U1],
agg col2: TypedColumn[V, U2],
col3: TypedColumn[V, U3]): Dataset[(K, U1, U2, U3)]
agg[U1, U2, U3, U4](
col1: TypedColumn[V, U1],
col2: TypedColumn[V, U2],
col3: TypedColumn[V, U3],
col4: TypedColumn[V, U4]): Dataset[(K, U1, U2, U3, U4)]

cogroup[U, R : Encoder](
cogroup other: KeyValueGroupedDataset[K, U])(
f: (K, Iterator[V], Iterator[U]) => TraversableOnce[R]): Dataset

count count(): Dataset[(K, Long)]

flatMapGroups flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce

flatMapGroupsWithState[S: Encoder, U: Encoder](


outputMode: OutputMode,
timeoutConf: GroupStateTimeout)(
func: (K, Iterator[V], GroupState[S]) => Iterator[U]): Dataset

flatMapGroupsWithState
Arbitrary Stateful Streaming Aggregation - streaming aggregation with expli
and state timeout

The difference between this flatMapGroupsWithState


mapGroupsWithState operators is the state function that genera
Note
or more elements (that are in turn the rows in the result streamin
Dataset ).

keys: Dataset[K]
keyAs keyAs[L : Encoder]: KeyValueGroupedDataset[L, V]

mapGroups mapGroups[U : Encoder](f: (K, Iterator[V]) => U): Dataset[U]

mapGroupsWithState[S: Encoder, U: Encoder](


func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]
mapGroupsWithState[S: Encoder, U: Encoder](
timeoutConf: GroupStateTimeout)(
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]

267
KeyValueGroupedDataset

mapGroupsWithState Creates a new Dataset with FlatMapGroupsWithState logical operator

The difference between mapGroupsWithState and flatMapGroups


Note is the state function that generates exactly one element (that is in
row in the result Dataset ).

mapValues mapValues[W : Encoder](func: V => W): KeyValueGroupedDataset

reduceGroups reduceGroups(f: (V, V) => V): Dataset[(K, V)]

Creating KeyValueGroupedDataset Instance


KeyValueGroupedDataset takes the following when created:

Encoder for keys

Encoder for values

QueryExecution

Data attributes

Grouping attributes

268
mapGroupsWithState Operator

mapGroupsWithState Operator — Stateful
Streaming Aggregation (with Explicit State
Logic)
mapGroupsWithState[S: Encoder, U: Encoder](
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U] (1)
mapGroupsWithState[S: Encoder, U: Encoder](
timeoutConf: GroupStateTimeout)(
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]

1. Uses GroupStateTimeout.NoTimeout for timeoutConf

mapGroupsWithState operator…​FIXME

mapGroupsWithState is a special case of flatMapGroupsWithState operator with


the following:
func being transformed to return a single-element Iterator
Note
Update output mode
mapGroupsWithState also creates a FlatMapGroupsWithState with
isMapGroupsWithState internal flag enabled.

// numGroups defined at the beginning


scala> :type numGroups
org.apache.spark.sql.KeyValueGroupedDataset[Long,(java.sql.Timestamp, Long)]

import org.apache.spark.sql.streaming.GroupState
def mappingFunc(key: Long, values: Iterator[(java.sql.Timestamp, Long)], state: GroupS
tate[Long]): Long = {
println(s">>> key: $key => state: $state")
val newState = state.getOption.map(_ + values.size).getOrElse(0L)
state.update(newState)
key
}

import org.apache.spark.sql.streaming.GroupStateTimeout
val longs = numGroups.mapGroupsWithState(
timeoutConf = GroupStateTimeout.ProcessingTimeTimeout)(
func = mappingFunc)

import org.apache.spark.sql.streaming.{OutputMode, Trigger}


import scala.concurrent.duration._
val q = longs.
writeStream.

269
mapGroupsWithState Operator

format("console").
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update). // <-- required for mapGroupsWithState
start

// Note GroupState

-------------------------------------------
Batch: 1
-------------------------------------------
>>> key: 0 => state: GroupState(<undefined>)
>>> key: 1 => state: GroupState(<undefined>)
+-----+
|value|
+-----+
| 0|
| 1|
+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
>>> key: 0 => state: GroupState(0)
>>> key: 1 => state: GroupState(0)
+-----+
|value|
+-----+
| 0|
| 1|
+-----+

-------------------------------------------
Batch: 3
-------------------------------------------
>>> key: 0 => state: GroupState(4)
>>> key: 1 => state: GroupState(4)
+-----+
|value|
+-----+
| 0|
| 1|
+-----+

// in the end
spark.streams.active.foreach(_.stop)

270
flatMapGroupsWithState Operator

flatMapGroupsWithState Operator — Arbitrary
Stateful Streaming Aggregation (with Explicit
State Logic)
KeyValueGroupedDataset[K, V].flatMapGroupsWithState[S: Encoder, U: Encoder](
outputMode: OutputMode,
timeoutConf: GroupStateTimeout)(
func: (K, Iterator[V], GroupState[S]) => Iterator[U]): Dataset[U]

flatMapGroupsWithState operator is used for Arbitrary Stateful Streaming Aggregation (with

Explicit State Logic).

flatMapGroupsWithState requires that the given OutputMode is either Append or Update

(and reports an IllegalArgumentException at runtime).

An OutputMode is a required argument, but does not seem to be used at all.


Note Check out the question What’s the purpose of OutputMode in
flatMapGroupsWithState? How/where is it used? on StackOverflow.

Every time the state function func is executed for a key, the state (as GroupState[S] ) is for
this key only.

K is the type of the keys in KeyValueGroupedDataset

V is the type of the values (per key) in KeyValueGroupedDataset


Note
S is the user-defined type of the state as maintained for each group

U is the type of rows in the result Dataset

Internally, flatMapGroupsWithState creates a new Dataset with FlatMapGroupsWithState


unary logical operator.

271
StreamingQueryManager

StreamingQueryManager — Streaming Query
Management
StreamingQueryManager is the management interface for active streaming queries of a

SparkSession.

Table 1. StreamingQueryManager API


Method Description

active: Array[StreamingQuery]
active
Active structured queries

addListener(listener: StreamingQueryListener): Unit


addListener
Registers (adds) a StreamingQueryListener

awaitAnyTermination(): Unit
awaitAnyTermination(timeoutMs: Long): Boolean
awaitAnyTermination
Waits until any streaming query terminats or timeoutMs
elapses

get(id: String): StreamingQuery


get(id: UUID): StreamingQuery
get

Gets the StreamingQuery by id

removeListener(
listener: StreamingQueryListener): Unit
removeListener

De-registers (removes) the StreamingQueryListener

resetTerminated(): Unit

resetTerminated
Resets the internal registry of the terminated streaming
queries (that lets awaitAnyTermination to be used again)

StreamingQueryManager is available using SparkSession.streams property.

272
StreamingQueryManager

scala> :type spark


org.apache.spark.sql.SparkSession

scala> :type spark.streams


org.apache.spark.sql.streaming.StreamingQueryManager

StreamingQueryManager is created when SessionState is created.

Figure 1. StreamingQueryManager
Tip Read up on SessionState in The Internals of Spark SQL gitbook.

StreamingQueryManager is used (internally) to create a StreamingQuery (and its

StreamExecution).

Figure 2. StreamingQueryManager Creates StreamingQuery (and StreamExecution)


StreamingQueryManager is notified about state changes of a structured query and passes

them along (to registered listeners).

StreamingQueryManager takes a single SparkSession when created.

StreamingQueryListenerBus —  listenerBus Internal


273
StreamingQueryManager

StreamingQueryListenerBus —  listenerBus Internal


Property

listenerBus: StreamingQueryListenerBus

listenerBus is a StreamingQueryListenerBus (for the current SparkSession) that is created

immediately when StreamingQueryManager is created.

listenerBus is used for the following:

Register or de-register a given StreamingQueryListener

Post a streaming event (and notify registered StreamingQueryListeners about the


event)

Getting All Active Streaming Queries —  active Method

active: Array[StreamingQuery]

active gets all active streaming queries.

Getting Active Continuous Query By Name —  get


Method

get(name: String): StreamingQuery

get method returns a StreamingQuery by name .

It may throw an IllegalArgumentException when no StreamingQuery exists for the name .

java.lang.IllegalArgumentException: There is no active query with name hello


at org.apache.spark.sql.StreamingQueryManager$$anonfun$get$1.apply(StreamingQueryMan
ager.scala:59)
at org.apache.spark.sql.StreamingQueryManager$$anonfun$get$1.apply(StreamingQueryMan
ager.scala:59)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.StreamingQueryManager.get(StreamingQueryManager.scala:58)
... 49 elided

Registering StreamingQueryListener —  addListener


Method

274
StreamingQueryManager

addListener(listener: StreamingQueryListener): Unit

addListener requests the StreamingQueryListenerBus to add the input listener .

De-Registering StreamingQueryListener 
—  removeListener Method

removeListener(listener: StreamingQueryListener): Unit

removeListener requests StreamingQueryListenerBus to remove the input listener .

Waiting for Any Streaming Query Termination 


—  awaitAnyTermination Method

awaitAnyTermination(): Unit
awaitAnyTermination(timeoutMs: Long): Boolean

awaitAnyTermination acquires a lock on awaitTerminationLock and waits until any streaming

query has finished (i.e. lastTerminatedQuery is available) or timeoutMs has expired.

awaitAnyTermination re-throws the StreamingQueryException from lastTerminatedQuery if it

reported one.

resetTerminated Method

resetTerminated(): Unit

resetTerminated forgets about the past-terminated query (so that awaitAnyTermination can

be used again to wait for a new streaming query termination).

Internally, resetTerminated acquires a lock on awaitTerminationLock and simply resets


lastTerminatedQuery (i.e. sets it to null ).

Creating Streaming Query —  createQuery Internal


Method

275
StreamingQueryManager

createQuery(
userSpecifiedName: Option[String],
userSpecifiedCheckpointLocation: Option[String],
df: DataFrame,
extraOptions: Map[String, String],
sink: BaseStreamingSink,
outputMode: OutputMode,
useTempCheckpointLocation: Boolean,
recoverFromCheckpointLocation: Boolean,
trigger: Trigger,
triggerClock: Clock): StreamingQueryWrapper

createQuery creates a StreamingQueryWrapper (for a StreamExecution per the input user-

defined properties).

Internally, createQuery first finds the name of the checkpoint directory of a query (aka
checkpoint location) in the following order:

1. Exactly the input userSpecifiedCheckpointLocation if defined

2. spark.sql.streaming.checkpointLocation Spark property if defined for the parent


directory with a subdirectory per the optional userSpecifiedName (or a randomly-
generated UUID)

3. (only when useTempCheckpointLocation is enabled) A temporary directory (as specified


by java.io.tmpdir JVM property) with a subdirectory with temporary prefix.

userSpecifiedCheckpointLocation can be any path that is acceptable by


Note
Hadoop’s Path.

If the directory name for the checkpoint location could not be found, createQuery reports a
AnalysisException .

checkpointLocation must be specified either through option("checkpointLocation", ...)


or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...)

createQuery reports a AnalysisException when the input recoverFromCheckpointLocation

flag is turned off but there is offsets directory in the checkpoint location.

createQuery makes sure that the logical plan of the structured query is analyzed (i.e. no

logical errors have been found).

Unless spark.sql.streaming.unsupportedOperationCheck Spark property is turned on,


createQuery checks the logical plan of the streaming query for unsupported operations.

276
StreamingQueryManager

(only when spark.sql.adaptive.enabled Spark property is turned on) createQuery prints out
a WARN message to the logs:

WARN spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and


will be disabled.

In the end, createQuery creates a StreamingQueryWrapper with a new


MicroBatchExecution.

recoverFromCheckpointLocation flag corresponds to


recoverFromCheckpointLocation flag that StreamingQueryManager uses to start a
streaming query and which is enabled by default (and is in fact the only place
where createQuery is used).
memory sink has the flag enabled for Complete output mode only
Note
foreach sink has the flag always enabled

console sink has the flag always disabled

all other sinks have the flag always enabled

userSpecifiedName corresponds to queryName option (that can be defined


Note using DataStreamWriter 's queryName method) while
userSpecifiedCheckpointLocation is checkpointLocation option.

createQuery is used exclusively when StreamingQueryManager is requested to


Note start a streaming query (when DataStreamWriter is requested to start an
execution of a streaming query).

Starting Streaming Query Execution —  startQuery


Internal Method

startQuery(
userSpecifiedName: Option[String],
userSpecifiedCheckpointLocation: Option[String],
df: DataFrame,
extraOptions: Map[String, String],
sink: BaseStreamingSink,
outputMode: OutputMode,
useTempCheckpointLocation: Boolean = false,
recoverFromCheckpointLocation: Boolean = true,
trigger: Trigger = ProcessingTime(0),
triggerClock: Clock = new SystemClock()): StreamingQuery

startQuery starts a streaming query and returns a handle to it.

277
StreamingQueryManager

Note trigger defaults to 0 milliseconds (as ProcessingTime(0)).

Internally, startQuery first creates a StreamingQueryWrapper, registers it in activeQueries


internal registry (by the id), requests it for the underlying StreamExecution and starts it.

In the end, startQuery returns the StreamingQueryWrapper (as part of the fluent API so
you can chain operators) or throws the exception that was reported when attempting to start
the query.

startQuery throws an IllegalArgumentException when there is another query registered

under name . startQuery looks it up in the activeQueries internal registry.

Cannot start query with name [name] as a query with that name is already active

startQuery throws an IllegalStateException when a query is started again from

checkpoint. startQuery looks it up in activeQueries internal registry.

Cannot start query with id [id] as another query with same id is


already active. Perhaps you are attempting to restart a query
from checkpoint that is already active.

startQuery is used exclusively when DataStreamWriter is requested to start


Note
an execution of the streaming query.

Posting StreamingQueryListener Event to


StreamingQueryListenerBus —  postListenerEvent
Internal Method

postListenerEvent(event: StreamingQueryListener.Event): Unit

postListenerEvent simply posts the input event to the internal event bus for streaming

events (StreamingQueryListenerBus).

278
StreamingQueryManager

Figure 3. StreamingQueryManager Propagates StreamingQueryListener Events


postListenerEvent is used exclusively when StreamExecution is requested to
Note
post a streaming event.

Handling Termination of Streaming Query (and Deactivating


Query in StateStoreCoordinator) 
—  notifyQueryTermination Internal Method

notifyQueryTermination(terminatedQuery: StreamingQuery): Unit

notifyQueryTermination removes the terminatedQuery from activeQueries internal registry

(by the query id).

notifyQueryTermination records the terminatedQuery in lastTerminatedQuery internal

registry (when no earlier streaming query was recorded or the terminatedQuery terminated
due to an exception).

notifyQueryTermination notifies others that are blocked on awaitTerminationLock.

In the end, notifyQueryTermination requests StateStoreCoordinator to deactivate all active


runs of the streaming query.

279
StreamingQueryManager

Figure 4. StreamingQueryManager’s Marking Streaming Query as Terminated


notifyQueryTermination is used exclusively when StreamExecution is
Note requested to run a streaming query and the query has finished (running
streaming batches) (with or without an exception).

Internal Properties

280
StreamingQueryManager

Name Description

Registry of StreamingQueries per UUID


Used when StreamingQueryManager is requested for active
activeQueries
streaming queries, get a streaming query by id, starts a
streaming query and is notified that a streaming query has
terminated.

activeQueriesLock

awaitTerminationLock

StreamingQuery that has recently been terminated, i.e.


stopped or due to an exception.
null when no streaming query has terminated yet or
resetTerminated.
lastTerminatedQuery
Used in awaitAnyTermination to know when a streaming
query has terminated
Set when StreamingQueryManager is notified that a
streaming query has terminated

StateStoreCoordinatorRef to the StateStoreCoordinator


RPC Endpoint
Created when StreamingQueryManager is created
Used when:
StreamingQueryManager is notified that a streaming
stateStoreCoordinator query has terminated

Stateful operators are executed, i.e.


FlatMapGroupsWithStateExec, StateStoreRestoreExec,
StateStoreSaveExec, StreamingDeduplicateExec and
StreamingSymmetricHashJoinExec
Creating StateStoreRDD (with storeUpdateFunction
aborting StateStore when a task fails)

281
SQLConf

SQLConf — Internal Configuration Store


SQLConf is an internal key-value configuration store for parameters and hints used to

configure a Spark Structured Streaming application (and Spark SQL applications in general).

The parameters and hints are accessible as property accessor methods.

SQLConf is available as the conf property of the SessionState of a SparkSession .

scala> :type spark


org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.conf


org.apache.spark.sql.internal.SQLConf

Table 1. SQLConf’s Property Accessor Methods


Method Name / Property Description
Used when:
DataSourceV2ScanExec
physical operator is requested
continuousStreamingExecutorQueueSize
the input RDDs (and creates a
spark.sql.streaming.continuous.executorQueueSize ContinuousDataSourceRDD
ContinuousCoalesceExec
physical operator is requested
execute

Used exclusively when


continuousStreamingExecutorPollIntervalMs DataSourceV2ScanExec
operator is requested for the input
spark.sql.streaming.continuous.executorPollIntervalMs RDDs (and creates a
ContinuousDataSourceRDD

Used exclusively when


disabledV2StreamingMicroBatchReaders
MicroBatchExecution is requested
spark.sql.streaming.disabledV2MicroBatchReaders the analyzed logical plan
streaming query)

fileSourceLogDeletion Used exclusively when


FileStreamSourceLog is requested
spark.sql.streaming.fileSource.log.deletion the isDeletingExpiredLog

fileSourceLogCleanupDelay Used exclusively when


FileStreamSourceLog is requested
spark.sql.streaming.fileSource.log.cleanupDelay the fileCleanupDelayMs

282
SQLConf

fileSourceLogCompactInterval Used exclusively when


FileStreamSourceLog is requested
spark.sql.streaming.fileSource.log.compactInterval the default compaction interval

Used when:
FlatMapGroupsWithStateStra
execution planning strategy is
requested to plan a streaming
FLATMAPGROUPSWITHSTATE_STATE_FORMAT_VERSION query (and creates a
FlatMapGroupsWithStateExec
spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion physical operator for every
FlatMapGroupsWithState
operator)
Among the checkpointed
properties

Used when:
CompactibleFileStreamLog
minBatchesToRetain
created
spark.sql.streaming.minBatchesToRetain
StreamExecution

StateStoreConf is

SHUFFLE_PARTITIONS
See spark.sql.shuffle.partitions
spark.sql.shuffle.partitions Internals of Spark SQL.

Used (as
stateStoreMinDeltasForSnapshot StateStoreConf.minDeltasForSnap
exclusively when
spark.sql.streaming.stateStore.minDeltasForSnapshot HDFSBackedStateStoreProvider
requested to doSnapshot

Used when:

StateStoreWriter
stateStoreProviderClass
stateStoreCustomMetrics
spark.sql.streaming.stateStore.providerClass StateStoreWriter
the metrics and getProgress

StateStoreConf is

Used when:

StatefulAggregationStrategy
execution planning strategy is
STREAMING_AGGREGATION_STATE_FORMAT_VERSION executed

spark.sql.streaming.aggregation.stateFormatVersion

283
SQLConf

OffsetSeqMetadata
for the relevantSQLConfs
relevantSQLConfDefaultValue

Used exclusively when


STREAMING_CHECKPOINT_FILE_MANAGER_CLASS
CheckpointFileManager
spark.sql.streaming.checkpointFileManagerClass requested to create a
CheckpointFileManager

Used exclusively when


streamingMetricsEnabled StreamExecution is requested for
runStream (to control whether to
spark.sql.streaming.metricsEnabled register a metrics reporter
streaming query)

STREAMING_MULTIPLE_WATERMARK_POLICY

spark.sql.streaming.multipleWatermarkPolicy

Used exclusively when


streamingNoDataMicroBatchesEnabled
MicroBatchExecution stream exec
spark.sql.streaming.noDataMicroBatches.enabled engine is requested to
streaming query

streamingNoDataProgressEventInterval
Used exclusively for ProgressRep
spark.sql.streaming.noDataProgressEventInterval

streamingPollingDelay
Used exclusively when
spark.sql.streaming.pollingDelay StreamExecution is created

Used exclusively when


streamingProgressRetention
ProgressReporter is requested to
spark.sql.streaming.numRecentProgressUpdates update progress of streaming quer
(and possibly remove an excess)

284
Configuration Properties

Configuration Properties
Configuration properties are used to fine-tune Spark Structured Streaming applications.

You can set them for a SparkSession when it is created using config method.

import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.config("spark.sql.streaming.metricsEnabled", true)
.getOrCreate

Tip Read up on SparkSession in The Internals of Spark SQL book.

Table 1. Structured Streaming’s Properties


Name Description
(internal) Version of the state forma
Default: 2
Supported values:
1 (for the legacy
StreamingAggregationStateMa
2 (for the default
StreamingAggregationStateMa
spark.sql.streaming.aggregation.stateFormatVersion
Used when StatefulAggregationStra
planning strategy is executed (and
streaming query with an aggregate
boils down to creating a
with the proper implementation vers
StreamingAggregationStateManage
Among the checkpointed properties
supposed to be overriden after a str
has once been started (and could la
from a checkpoint after being restar

(internal) CheckpointFileManager
checkpoint files atomically
spark.sql.streaming.checkpointFileManagerClass Default: FileContextBasedCheckpo
(with FileSystemBasedCheckpointF
case of unsupported file system use
metadata files)

285
Configuration Properties

Default checkpoint directory for stor


spark.sql.streaming.checkpointLocation
data
Default: (empty)

(internal) The size (measured in nu


of the queue used in continuous ex
spark.sql.streaming.continuous.executorQueueSize
the results of a ContinuousDataRea
Default: 1024

(internal) The interval (in millis) at w


continuous execution readers will p
spark.sql.streaming.continuous.executorPollIntervalMs whether the epoch has advanced o

Default: 100 (ms)

(internal) A comma-separated list o


class names of data source provide
MicroBatchReadSupport
these sources will fall back to the V
spark.sql.streaming.disabledV2MicroBatchReaders
Default: (empty)
Use
SQLConf.disabledV2StreamingMicr
to get the current value.

(internal) How long (in millis) a file


to be visible for all readers.
spark.sql.streaming.fileSource.log.cleanupDelay Default: 10 (minutes)
Use SQLConf.fileSourceLogCleanu
the current value.

(internal) Number of log files after w


previous files are compacted into th

Default: 10
spark.sql.streaming.fileSource.log.compactInterval
Must be a positive value (greater th

Use SQLConf.fileSourceLogCompa
the current value.

(internal) Whether to delete the exp


file stream source
spark.sql.streaming.fileSource.log.deletion Default: true
Use SQLConf.fileSourceLogDeletio
current value.

286
Configuration Properties

(internal) State format version used


StateManager for FlatMapGroupsW
physical operator
Default: 2
Supported values:
spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion
1

Among the checkpointed properties


supposed to be overriden after a str
has once been started (and could la
from a checkpoint after being restar

(internal) The maximum number of


will be retained in memory to avoid
files.
Default: 2

Maximum count of versions a State


implementation should retain in me
The value adjusts a trade-off betwe
usage vs cache miss:
spark.sql.streaming.maxBatchesToRetainInMemory
2 covers both success and di
cases
1 covers only success case

0 or negative value disables c


maximize memory size of exec
Used exclusively when
HDFSBackedStateStoreProvider
initialize.

Flag whether Dropwizard CodaHale


reported for active streaming querie
spark.sql.streaming.metricsEnabled Default: false

Use SQLConf.streamingMetricsEna
current value

(internal) The minimum number of


for failure recovery
spark.sql.streaming.minBatchesToRetain Default: 100
Use SQLConf.minBatchesToRetain
current value

287
Configuration Properties

Global watermark policy


calculate the global watermark valu
are multiple watermark operators in
query
Default: min
Supported values:
spark.sql.streaming.multipleWatermarkPolicy
min - chooses the minimum w
reported across multiple opera
max - chooses the maximum a
operators
Cannot be changed between query
the same checkpoint location.

Flag to control whether the


engine should execute batches with
process for eager state manageme
streaming queries ( true
spark.sql.streaming.noDataMicroBatches.enabled
Default: true

Use
SQLConf.streamingNoDataMicroBa
to get the current value

(internal) How long to wait between


events when there is no data (in mi
ProgressReporter is requested to

spark.sql.streaming.noDataProgressEventInterval Default: 10000L

Use
SQLConf.streamingNoDataProgres
to get the current value

Number of StreamingQueryProgres
progressBuffer internal registry whe
ProgressReporter is requested to
of streaming query
spark.sql.streaming.numRecentProgressUpdates
Default: 100
Use SQLConf.streamingProgressR
the current value

(internal) How long (in millis) to del


StreamExecution before
spark.sql.streaming.pollingDelay
no data was available in a batch
Default: 10 (milliseconds)

288
Configuration Properties

The initial delay and how often to ex


spark.sql.streaming.stateStore.maintenanceInterval StateStore’s maintenance task
Default: 60s

(internal) Minimum number of state


files that need to be generated befo
HDFSBackedStateStore will consid
snapshot (consolidate the deltas int
spark.sql.streaming.stateStore.minDeltasForSnapshot
Default: 10
Use SQLConf.stateStoreMinDeltasF
get the current value.

(internal) The fully-qualified class n


StateStoreProvider implementation
state data in stateful streaming que
must have a zero-arg constructor.
spark.sql.streaming.stateStore.providerClass
Default: HDFSBackedStateStorePr

Use SQLConf.stateStoreProviderCl
current value.

(internal) When enabled (


StreamingQueryManager
spark.sql.streaming.unsupportedOperationCheck plan of a streaming query uses sup
operations only.
Default: true

289
StreamingQueryListener

StreamingQueryListener — Intercepting Life
Cycle Events of Streaming Queries
StreamingQueryListener is the contract of listeners that want to be notified about the life

cycle events of streaming queries, i.e. start, progress and termination.

Table 1. StreamingQueryListener Contract


Method Description

onQueryStarted(
event: QueryStartedEvent): Unit

onQueryStarted
Informs that DataStreamWriter was requested to start
execution of the streaming query (on the stream execution
thread)

onQueryProgress(
event: QueryProgressEvent): Unit
onQueryProgress

Informs that MicroBatchExecution has finished


triggerExecution phase (the end of a streaming batch)

onQueryTerminated(
event: QueryTerminatedEvent): Unit
onQueryTerminated

Informs that a streaming query was stopped or terminated


due to an error

StreamingQueryListener is informed about the life cycle events when

StreamingQueryListenerBus is requested to doPostEvent.

290
StreamingQueryListener

Table 2. StreamingQueryListener’s Life Cycle Events and Callbacks


Event Callback Description

Posted when
StreamExecution is
QueryStartedEvent
requested to run stream
id processing (when
onQueryStarted DataStreamWriter is
runId requested to start
execution of the
name
streaming query on the
stream execution thread)

Posted when
ProgressReporter is
requested to update
QueryProgressEvent progress of a streaming
onQueryProgress query (after
StreamingQueryProgress MicroBatchExecution has
finished triggerExecution
phase at the end of a
streaming batch)

QueryTerminatedEvent Posted when


StreamExecution is
id requested to run stream
runId onQueryTerminated processing (and the
streaming query was
exception if terminated stopped or terminated
due to an error due to an error)

You can register a StreamingQueryListener using StreamingQueryManager.addListener


method.

val queryListener: StreamingQueryListener = ...


spark.streams.addListener(queryListener)

You can remove a StreamingQueryListener using StreamingQueryManager.removeListener


method.

val queryListener: StreamingQueryListener = ...


spark.streams.removeListener(queryListener)

291
StreamingQueryListener

Figure 1. StreamingQueryListener Notified about Query’s Start (onQueryStarted)


onQueryStarted is used internally to unblock the starting thread of
Note
StreamExecution .

Figure 2. StreamingQueryListener Notified about Query’s Progress (onQueryProgress)

Figure 3. StreamingQueryListener Notified about Query’s Termination (onQueryTerminated)


You can also register a streaming event listener using the general
SparkListener interface.
Note
Read up on SparkListener in the The Internals of Apache Spark book.

292
StreamingQueryListener

293
ProgressReporter

ProgressReporter Contract
ProgressReporter is the contract of stream execution progress reporters that report the

statistics of execution of a streaming query.

Table 1. ProgressReporter Contract


Method Description

currentBatchId: Long
currentBatchId

Id of the current streaming micro-batch

id: UUID

id

Universally unique identifier (UUID) of the streaming query


(that stays unchanged between restarts)

lastExecution: QueryExecution
lastExecution

QueryExecution of the streaming query

logicalPlan: LogicalPlan

Logical query plan of the streaming query

logicalPlan Used when ProgressReporter is requested for the following:


extract statistics from the most recent query execution
(to add watermark metric when a streaming watermark
is used)
extractSourceToNumInputRows

name: String
name

Name of the streaming query

newData: Map[BaseStreamingSource, LogicalPlan]

Streaming readers and sources with the new data (as a


LogicalPlan )
newData

294
ProgressReporter

Used when:
ProgressReporter extracts statistics from the most
recent query execution (to calculate the so-called
inputRows )

offsetSeqMetadata: OffsetSeqMetadata

offsetSeqMetadata
OffsetSeqMetadata (with the current micro-batch event-time
watermark and timestamp)

postEvent(event: StreamingQueryListener.Event): Unit


postEvent

Posts StreamingQueryListener.Event

runId: UUID

runId
Universally unique identifier (UUID) of the single run of the
streaming query (that changes every restart)

sink: BaseStreamingSink

sink
The one and only streaming writer or sink of the streaming
query

sources: Seq[BaseStreamingSource]

sources
Streaming readers and sources of the streaming query
Used when finishing a trigger (and updating progress and
marking current status as trigger inactive)

sparkSession: SparkSession

sparkSession SparkSession of the streaming query

Read up on SparkSession in The Internals of


Tip
Spark SQL book.

triggerClock: Clock
triggerClock

Clock of the streaming query

295
ProgressReporter

StreamExecution is the one and only known direct extension of the


Note
ProgressReporter Contract in Spark Structured Streaming.

ProgressReporter uses the spark.sql.streaming.noDataProgressEventInterval configuration

property to control how long to wait between two progress events when there is no data
(default: 10000L ) when finishing trigger.

ProgressReporter uses yyyy-MM-dd'T'HH:mm:ss.SSS'Z' time format (with UTC timezone).

296
ProgressReporter

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sampleQuery = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.ProcessingTime(10.seconds))
.start

// Using public API


import org.apache.spark.sql.streaming.SourceProgress
scala> sampleQuery.
| lastProgress.
| sources.
| map { case sp: SourceProgress =>
| s"source = ${sp.description} => endOffset = ${sp.endOffset}" }.
| foreach(println)
source = RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8] => endOffse
t = 663

scala> println(sampleQuery.lastProgress.sources(0))
res40: org.apache.spark.sql.streaming.SourceProgress =
{
"description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",
"startOffset" : 333,
"endOffset" : 343,
"numInputRows" : 10,
"inputRowsPerSecond" : 0.9998000399920015,
"processedRowsPerSecond" : 200.0
}

// With a hack
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val offsets = sampleQuery.
asInstanceOf[StreamingQueryWrapper].
streamingQuery.
availableOffsets.
map { case (source, offset) =>
s"source = $source => offset = $offset" }
scala> offsets.foreach(println)
source = RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8] => offset =
293

297
ProgressReporter

Configure logging of the concrete stream execution progress reporters to see


what happens inside a ProgressReporter :
Tip ContinuousExecution
MicroBatchExecution

progressBuffer Internal Property

progressBuffer: Queue[StreamingQueryProgress]

progressBuffer is a scala.collection.mutable.Queue of StreamingQueryProgresses.

progressBuffer has a new StreamingQueryProgress added when ProgressReporter is

requested to update progress of a streaming query.

When the size (the number of StreamingQueryProgresses ) is above


spark.sql.streaming.numRecentProgressUpdates threshold, the oldest
StreamingQueryProgress is removed (dequeued).

progressBuffer is used when ProgressReporter is requested for the last and the recent

StreamingQueryProgresses

status Method

status: StreamingQueryStatus

status gives the current StreamingQueryStatus.

status is used when StreamingQueryWrapper is requested for the current


Note
status of a streaming query (that is part of StreamingQuery Contract).

Updating Progress of Streaming Query 


—  updateProgress Internal Method

updateProgress(newProgress: StreamingQueryProgress): Unit

updateProgress records the input newProgress and posts a QueryProgressEvent event.

298
ProgressReporter

Figure 1. ProgressReporter’s Reporting Query Progress


updateProgress adds the input newProgress to progressBuffer.

updateProgress removes elements from progressBuffer if their number is or exceeds the

value of spark.sql.streaming.numRecentProgressUpdates property.

updateProgress posts a QueryProgressEvent (with the input newProgress ).

updateProgress prints out the following INFO message to the logs:

Streaming query made progress: [newProgress]

updateProgress synchronizes concurrent access to the progressBuffer internal


Note
registry.

updateProgress is used exclusively when ProgressReporter is requested to


Note
finish up a trigger.

Initializing Query Progress for New Trigger 


—  startTrigger Method

startTrigger(): Unit

startTrigger prints out the following DEBUG message to the logs:

Starting Trigger Calculation

299
ProgressReporter

Table 2. startTrigger’s Internal Registry Changes For New Trigger


Registry New Value

lastTriggerStartTimestamp currentTriggerStartTimestamp

Requests the trigger clock for the current timestamp


currentTriggerStartTimestamp
(in millis)

Enables ( true ) the isTriggerActive flag of the


currentStatus
currentStatus

currentTriggerStartOffsets null

currentTriggerEndOffsets null

currentDurationsMs Clears the currentDurationsMs

startTrigger is used when:

MicroBatchExecution stream execution engine is requested to run an


activated streaming query (at the beginning of every trigger)
Note ContinuousExecution stream execution engine is requested to run an
activated streaming query (at the beginning of every trigger)
StreamExecution starts running batches (as part of TriggerExecutor executing a
batch runner).

Finishing Up Streaming Batch (Trigger) and Generating


StreamingQueryProgress —  finishTrigger Method

finishTrigger(hasNewData: Boolean): Unit

Internally, finishTrigger sets currentTriggerEndTimestamp to the current time (using


triggerClock).

finishTrigger extractExecutionStats.

finishTrigger calculates the processing time (in seconds) as the difference between the

end and start timestamps.

finishTrigger calculates the input time (in seconds) as the difference between the start

time of the current and last triggers.

300
ProgressReporter

Figure 2. ProgressReporter’s finishTrigger and Timestamps


finishTrigger prints out the following DEBUG message to the logs:

Execution stats: [executionStats]

finishTrigger creates a SourceProgress (aka source statistics) for every source used.

finishTrigger creates a SinkProgress (aka sink statistics) for the sink.

finishTrigger creates a StreamingQueryProgress.

If there was any data (using the input hasNewData flag), finishTrigger resets
lastNoDataProgressEventTime (i.e. becomes the minimum possible time) and updates
query progress.

Otherwise, when no data was available (using the input hasNewData flag), finishTrigger
updates query progress only when lastNoDataProgressEventTime passed.

In the end, finishTrigger disables isTriggerActive flag of StreamingQueryStatus (i.e.


sets it to false ).

finishTrigger is used exclusively when MicroBatchExecution is requested to


Note run the activated streaming query (after triggerExecution Phase at the end of a
streaming batch).

Time-Tracking Section (Recording Execution Time for


Progress Reporting) —  reportTimeTaken Method

reportTimeTaken[T](
triggerDetailKey: String)(
body: => T): T

reportTimeTaken measures the time to execute body and records it in the

currentDurationsMs internal registry under triggerDetailKey key. If the triggerDetailKey


key was recorded already, the current execution time is added.

In the end, reportTimeTaken prints out the following DEBUG message to the logs and
returns the result of executing body .

[triggerDetailKey] took [time] ms

301
ProgressReporter

reportTimeTaken is used when the stream execution engines are requested to


execute the following phases (that appear as triggerDetailKey in the DEBUG
message in the logs):

MicroBatchExecution

triggerExecution
getOffset

setOffsetRange
getEndOffset
Note
walCommit
getBatch
queryPlanning
addBatch
ContinuousExecution

queryPlanning
runContinuous

Updating Status Message —  updateStatusMessage


Method

updateStatusMessage(message: String): Unit

updateStatusMessage simply updates the message in the StreamingQueryStatus internal

registry.

updateStatusMessage is used when:

StreamExecution is requested to run stream processing


Note
MicroBatchExecution is requested to run an activated streaming query,
construct the next streaming micro-batch

Generating Execution Statistics 


—  extractExecutionStats Internal Method

extractExecutionStats(hasNewData: Boolean): ExecutionStats

302
ProgressReporter

extractExecutionStats generates an ExecutionStats of the last execution of the streaming

query.

Internally, extractExecutionStats generate watermark metric (using the event-time


watermark of the OffsetSeqMetadata) if there is a EventTimeWatermark unary logical
operator in the logical plan of the streaming query.

EventTimeWatermark unary logical operator represents Dataset.withWatermark


Note
operator in a streaming query.

extractExecutionStats extractStateOperatorMetrics.

extractExecutionStats extractSourceToNumInputRows.

extractExecutionStats finds the EventTimeWatermarkExec unary physical operator (with

non-zero EventTimeStats) and generates max, min, and avg statistics.

In the end, extractExecutionStats creates a ExecutionStats with the execution statistics.

If the input hasNewData flag is turned off ( false ), extractExecutionStats returns an


ExecutionStats with no input rows and event-time statistics (that require data to be
processed to have any sense).

extractExecutionStats is used exclusively when ProgressReporter is


Note requested to finish up a streaming batch (trigger) and generate a
StreamingQueryProgress.

Generating StateStoreWriter Metrics


(StateOperatorProgress) 
—  extractStateOperatorMetrics Internal Method

extractStateOperatorMetrics(
hasNewData: Boolean): Seq[StateOperatorProgress]

extractStateOperatorMetrics requests the QueryExecution for the optimized execution plan

( executedPlan ) and finds all StateStoreWriter physical operators and requests them for
StateOperatorProgress.

extractStateOperatorMetrics clears (zeros) the numRowsUpdated metric for the given

hasNewData turned off ( false ).

extractStateOperatorMetrics returns an empty collection for the QueryExecution

uninitialized ( null ).

303
ProgressReporter

extractStateOperatorMetrics is used exclusively when ProgressReporter is


Note
requested to generate execution statistics.

extractSourceToNumInputRows Internal Method

extractSourceToNumInputRows(): Map[BaseStreamingSource, Long]

extractSourceToNumInputRows …​FIXME

extractSourceToNumInputRows is used exclusively when ProgressReporter is


Note
requested to generate execution statistics.

formatTimestamp Internal Method

formatTimestamp(millis: Long): String

formatTimestamp …​FIXME

Note formatTimestamp is used when…​FIXME

Recording Trigger Offsets (StreamProgress) 


—  recordTriggerOffsets Method

recordTriggerOffsets(
from: StreamProgress,
to: StreamProgress): Unit

recordTriggerOffsets simply sets (records) the currentTriggerStartOffsets and

currentTriggerEndOffsets internal registries to the json representations of the from and to


StreamProgresses.

recordTriggerOffsets is used when:

MicroBatchExecution is requested to run the activated streaming query


Note
ContinuousExecution is requested to commit an epoch

Last StreamingQueryProgress —  lastProgress Method

304
ProgressReporter

lastProgress: StreamingQueryProgress

lastProgress …​FIXME

Note lastProgress is used when…​FIXME

recentProgress Method

recentProgress: Array[StreamingQueryProgress]

recentProgress …​FIXME

Note recentProgress is used when…​FIXME

Internal Properties
Name Description
scala.collection.mutable.HashMap of action names (aka
triggerDetailKey) and their cumulative times (in milliseconds).

Starts empty when ProgressReporter sets the state for a new


batch with new entries added or updated when reporting
execution time (of an action).

You can see the current value of currentDurationsMs


in progress reports under durationMs .
currentDurationsMs

scala> query.lastProgress.durationMs
res3: java.util.Map[String,Long] =
Tip {triggerExecution=60,
queryPlanning=1, getBatch=5,
getOffset=0, addBatch=30,
walCommit=23}

StreamingQueryStatus with the current status of the streaming


query
currentStatus
Available using status method

message updated with updateStatusMessage

currentTriggerEndOffsets

Timestamp of when the current batch/trigger has ended

305
ProgressReporter

Default: -1L

currentTriggerStartOffsets: Map[BaseStreamingSource, String

Start offsets (in JSON format) per source


currentTriggerStartOffsets
Used exclusively when finishing up a streaming batch (trigger)
and generating StreamingQueryProgress (for a SourceProgress

Reset ( null ) when initializing a query progress for a new trigger


Initialized when recording trigger offsets (StreamProgress)

Timestamp of when the current batch/trigger has started


currentTriggerStartTimestamp
Default: -1L

lastNoDataProgressEventTime Default: Long.MinValue

Timestamp of when the last batch/trigger started


lastTriggerStartTimestamp
Default: -1L

Flag to…​FIXME
metricWarningLogged
Default: false

306
StreamingQueryProgress

StreamingQueryProgress
StreamingQueryProgress holds information about the progress of a streaming query.

StreamingQueryProgress is created exclusively when StreamExecution finishes a trigger.

Use lastProgress property of a StreamingQuery to access the most recent


StreamingQueryProgress update.

Note val sq: StreamingQuery = ...


sq.lastProgress

Use recentProgress property of a StreamingQuery to access the most recent


StreamingQueryProgress updates.

Note val sq: StreamingQuery = ...


sq.recentProgress

Use StreamingQueryListener to get notified about StreamingQueryProgress


Note updates while a streaming query is executed.

307
StreamingQueryProgress

Table 1. StreamingQueryProgress’s Properties


Name Description
id Unique identifier of a streaming query

runId Unique identifier of the current execution of a streaming query

name Optional query name

timestamp Time when the trigger has started (in ISO8601 format).

batchId Unique id of the current batch

durationMs Durations of the internal phases (in milliseconds)

eventTime Statistics of event time seen in this batch

stateOperators Information about stateful operators in the query that store state.

sources
Statistics about the data read from every streaming source in a
streaming query

sink Information about progress made for a sink

308
ExecutionStats

ExecutionStats
ExecutionStats is…​FIXME

309
SourceProgress

SourceProgress
SourceProgress is…​FIXME

310
SinkProgress

SinkProgress
SinkProgress is…​FIXME

311
StreamingQueryStatus

StreamingQueryStatus
StreamingQueryStatus is…​FIXME

312
MetricsReporter

MetricsReporter
MetricsReporter is…​FIXME

313
Web UI

Web UI
Web UI…​FIXME

Caution FIXME What’s visible on the plan diagram in the SQL tab of the UI

314
Logging

Logging
Caution FIXME

315
FileStreamSource

FileStreamSource
FileStreamSource is a Source that reads text files from path directory as they appear. It

uses LongOffset offsets.

Note It is used by DataSource.createSource for FileFormat .

You can provide the schema of the data and dataFrameBuilder - the function to build a
DataFrame in getBatch at instantiation time.

// NOTE The source directory must exist


// mkdir text-logs

val df = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load("text-logs")

scala> df.printSchema
root
|-- value: string (nullable = true)

Batches are indexed.

It lives in org.apache.spark.sql.execution.streaming package.

import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", LongType, nullable = false) ::
StructField("name", StringType, nullable = false) ::
StructField("score", DoubleType, nullable = false) :: Nil)

// You should have input-json directory available


val in = spark.readStream
.format("json")
.schema(schema)
.load("input-json")

val input = in.transform { ds =>


println("transform executed") // <-- it's going to be executed once only
ds
}

scala> input.isStreaming
res9: Boolean = true

316
FileStreamSource

It tracks already-processed files in seenFiles hash map.

Enable DEBUG or TRACE logging level for


org.apache.spark.sql.execution.streaming.FileStreamSource to see what happens
inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.FileStreamSource=TRACE

Refer to Logging.

Creating FileStreamSource Instance

Caution FIXME

Options

maxFilesPerTrigger

maxFilesPerTrigger option specifies the maximum number of files per trigger (batch). It

limits the file stream source to read the maxFilesPerTrigger number of files specified at a
time and hence enables rate limiting.

It allows for a static set of files be used like a stream for testing as the file set is processed
maxFilesPerTrigger number of files at a time.

schema
If the schema is specified at instantiation time (using optional dataSchema constructor
parameter) it is returned.

Otherwise, fetchAllFiles internal method is called to list all the files in a directory.

When there is at least one file the schema is calculated using dataFrameBuilder constructor
parameter function. Else, an IllegalArgumentException("No schema specified") is thrown
unless it is for text provider (as providerName constructor parameter) where the default
schema with a single value column of type StringType is assumed.

text as the value of providerName constructor parameter denotes text file


Note
stream provider.

getOffset Method

317
FileStreamSource

getOffset: Option[Offset]

Note getOffset is part of the Source Contract to find the latest offset.

getOffset …​FIXME

The maximum offset ( getOffset ) is calculated by fetching all the files in path excluding
files that start with _ (underscore).

When computing the maximum offset using getOffset , you should see the following
DEBUG message in the logs:

Listed ${files.size} in ${(endTime.toDouble - startTime) / 1000000}ms

When computing the maximum offset using getOffset , it also filters out the files that were
already seen (tracked in seenFiles internal registry).

You should see the following DEBUG message in the logs (depending on the status of a
file):

new file: $file


// or
old file: $file

Generating DataFrame for Streaming Batch —  getBatch


Method
FileStreamSource.getBatch asks metadataLog for the batch.

You should see the following INFO and DEBUG messages in the logs:

INFO Processing ${files.length} files from ${startId + 1}:$endId


DEBUG Streaming ${files.mkString(", ")}

The method to create a result batch is given at instantiation time (as dataFrameBuilder
constructor parameter).

metadataLog
metadataLog is a metadata storage using metadataPath path (which is a constructor

parameter).

318
FileStreamSource

Note It extends HDFSMetadataLog[Seq[String]] .

Caution FIXME Review HDFSMetadataLog

fetchMaxOffset Internal Method

fetchMaxOffset(): FileStreamSourceOffset

fetchMaxOffset …​FIXME

fetchMaxOffset is used exclusively when FileStreamSource is requested to


Note
getOffset.

fetchAllFiles Internal Method

fetchAllFiles(): Seq[(String, Long)]

fetchAllFiles …​FIXME

fetchAllFiles is used exclusively when FileStreamSource is requested to


Note
fetchMaxOffset.

allFilesUsingMetadataLogFileIndex Internal
Method

allFilesUsingMetadataLogFileIndex(): Seq[FileStatus]

allFilesUsingMetadataLogFileIndex simply creates a new MetadataLogFileIndex and

requests it to allFiles .

allFilesUsingMetadataLogFileIndex is used exclusively when FileStreamSource


Note is requested to fetchAllFiles (when requested for fetchMaxOffset when
FileStreamSource is requested to getOffset).

319
FileStreamSink

FileStreamSink — Streaming Sink for File-


Based Data Sources
FileStreamSink is a concrete streaming sink that writes out the results of a streaming query

to files (of the specified FileFormat) in the root path.

import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = in.
writeStream.
format("parquet").
option("path", "parquet-output-dir").
option("checkpointLocation", "checkpoint-dir").
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Append).
start

FileStreamSink is created exclusively when DataSource is requested to create a streaming

sink for a file-based data source (i.e. FileFormat ).

Tip Read up on FileFormat in The Internals of Spark SQL book.

FileStreamSink supports Append output mode only.

FileStreamSink uses spark.sql.streaming.fileSink.log.deletion (as isDeletingExpiredLog )

The textual representation of FileStreamSink is FileSink[path]

FileStreamSink uses _spark_metadata directory for…​FIXME

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.FileStreamSink to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.FileStreamSink=ALL

Refer to Logging.

Creating FileStreamSink Instance


FileStreamSink takes the following to be created:

320
FileStreamSink

SparkSession

Root directory

FileFormat

Names of the partition columns

Configuration options

FileStreamSink initializes the internal properties.

"Adding" Batch of Data to Sink —  addBatch Method

addBatch(
batchId: Long,
data: DataFrame): Unit

Note addBatch is a part of Sink Contract to "add" a batch of data to the sink.

addBatch …​FIXME

Creating BasicWriteJobStatsTracker 
—  basicWriteJobStatsTracker Internal Method

basicWriteJobStatsTracker: BasicWriteJobStatsTracker

basicWriteJobStatsTracker simply creates a BasicWriteJobStatsTracker with the basic

metrics:

number of written files

bytes of written output

number of output rows

number of dynamic partitions

Tip Read up on BasicWriteJobStatsTracker in The Internals of Spark SQL book.

basicWriteJobStatsTracker is used exclusively when FileStreamSink is


Note
requested to addBatch.

hasMetadata Object Method

321
FileStreamSink

hasMetadata(
path: Seq[String],
hadoopConf: Configuration): Boolean

hasMetadata …​FIXME

hasMetadata is used when:

DataSource (Spark SQL) is requested to resolve a FileFormat relation


Note ( resolveRelation ) and creates a HadoopFsRelation
FileStreamSource is requested to fetchAllFiles

Internal Properties

Name Description
Base path (Hadoop’s Path for the given path)
basePath
Used when…​FIXME

Metadata log path (Hadoop’s Path for the base path and
logPath the _spark_metadata)

Used exclusively to create the FileStreamSinkLog

FileStreamSinkLog (for the version 1 and the metadata log


path)
fileLog
Used exclusively when FileStreamSink is requested to
addBatch

Hadoop’s Configuration
hadoopConf
Used when…​FIXME

322
FileStreamSinkLog

FileStreamSinkLog
FileStreamSinkLog is a concrete CompactibleFileStreamLog (of SinkFileStatuses) for

FileStreamSink and MetadataLogFileIndex.

FileStreamSinkLog uses 1 for the version.

FileStreamSinkLog uses add action to create new metadata logs.

FileStreamSinkLog uses delete action to mark metadata logs that should be excluded from

compaction.

Creating FileStreamSinkLog Instance


FileStreamSinkLog (like the parent CompactibleFileStreamLog) takes the following to be

created:

Metadata version

SparkSession

Path of the metadata log directory

compactLogs Method

compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus]

Note compactLogs is part of the CompactibleFileStreamLog Contract to…​FIXME.

compactLogs …​FIXME

323
SinkFileStatus

SinkFileStatus
SinkFileStatus represents the status of files of FileStreamSink (and the type of the

metadata of FileStreamSinkLog):

Path

Size

isDir flag

Modification time

Block replication

Block size

Action (either add or delete)

toFileStatus Method

toFileStatus: FileStatus

toFileStatus simply creates a new Hadoop FileStatus.

Note toFileStatus is used exclusively when MetadataLogFileIndex is created.

Creating SinkFileStatus Instance —  apply Object Method

apply(f: FileStatus): SinkFileStatus

apply simply creates a new SinkFileStatus (with add action).

apply is used exclusively when ManifestFileCommitProtocol is requested to


Note
commitTask.

324
ManifestFileCommitProtocol

ManifestFileCommitProtocol
ManifestFileCommitProtocol is…​FIXME

commitJob Method

commitJob(
jobContext: JobContext,
taskCommits: Seq[TaskCommitMessage]): Unit

Note commitJob is part of the FileCommitProtocol contract to…​FIXME.

commitJob …​FIXME

commitTask Method

commitTask(
taskContext: TaskAttemptContext): TaskCommitMessage

Note commitTask is part of the FileCommitProtocol contract to…​FIXME.

commitTask …​FIXME

325
MetadataLogFileIndex

MetadataLogFileIndex
MetadataLogFileIndex is a PartitioningAwareFileIndex of metadata log files (generated by

FileStreamSink).

MetadataLogFileIndex is created when:

DataSource (Spark SQL) is requested to resolve a FileFormat relation

( resolveRelation ) and creates a HadoopFsRelation

FileStreamSource is requested to allFilesUsingMetadataLogFileIndex

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.MetadataLogFileIndex to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.MetadataLogFileIndex=ALL

Refer to Logging.

Creating MetadataLogFileIndex Instance


MetadataLogFileIndex takes the following to be created:

SparkSession

Hadoop’s Path

User-defined schema ( Option[StructType] )

MetadataLogFileIndex initializes the internal properties.

While being created, MetadataLogFileIndex prints out the following INFO message to the
logs:

Reading streaming file log from [metadataDirectory]

Internal Properties

326
MetadataLogFileIndex

Name Description

Metadata directory (Hadoop’s Path of the _spark_metadata


metadataDirectory directory under the path)
Used when…​FIXME

metadataLog FileStreamSinkLog (with the _spark_metadata directory)

allFilesFromLog Metadata log files

327
Kafka Data Source

Kafka Data Source — Streaming Data Source


for Apache Kafka
Kafka Data Source is the streaming data source for Apache Kafka in Spark Structured
Streaming.

Kafka Data Source provides a streaming source and a streaming sink for micro-batch and
continuous stream processing.

spark-sql-kafka-0-10 External Module


Kafka Data Source is part of the spark-sql-kafka-0-10 external module that is distributed
with the official distribution of Apache Spark, but it is not included in the CLASSPATH by
default.

You should define spark-sql-kafka-0-10 module as part of the build definition in your Spark
project, e.g. as a libraryDependency in build.sbt for sbt:

libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.4"

For Spark environments like spark-submit (and "derivatives" like spark-shell ), you should
use --packages command-line option:

./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.4

Replace the version of spark-sql-kafka-0-10 module (e.g. 2.4.4 above) with


Note one of the available versions found at The Central Repository’s Search that
matches your version of Apache Spark.

Streaming Source
With spark-sql-kafka-0-10 module you can use kafka data source format for loading data
(reading records) from one or more Kafka topics as a streaming Dataset.

val records = spark


.readStream
.format("kafka")
.option("subscribePattern", """topic-\d{2}""") // topics with two digits at the end
.option("kafka.bootstrap.servers", ":9092")
.load

328
Kafka Data Source

Kafka data source supports many options for reading.

Internally, the kafka data source format for reading is available through
KafkaSourceProvider that is a MicroBatchReadSupport and ContinuousReadSupport for
micro-batch and continuous stream processing, respectively.

Predefined (Fixed) Schema


Kafka Data Source uses a predefined (fixed) schema.

Table 1. Kafka Data Source’s Fixed Schema (in the positional order)
Name Type
key BinaryType

value BinaryType

topic StringType

partition IntegerType

offset LongType

timestamp TimestampType

timestampType IntegerType

scala> records.printSchema
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)

Internally, the fixed schema is defined as part of the DataSourceReader contract through
MicroBatchReader and ContinuousReader extension contracts for micro-batch and
continuous stream processing, respectively.

Tip Read up on DataSourceReader in The Internals of Spark SQL book.

329
Kafka Data Source

Use Column.cast operator to cast BinaryType to a StringType (for key and


value columns).

scala> :type records


org.apache.spark.sql.DataFrame
Tip val values = records
.select($"value" cast "string") // deserializing values
scala> values.printSchema
root
|-- value: string (nullable = true)

Streaming Sink
With spark-sql-kafka-0-10 module you can use kafka data source format for writing the
result of executing a streaming query (a streaming Dataset) to one or more Kafka topics.

val sq = records
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", ":9092")
.option("topic", "kafka2console-output")
.option("checkpointLocation", "checkpointLocation-kafka2console")
.start

Internally, the kafka data source format for writing is available through KafkaSourceProvider
that is a StreamWriteSupport.

Micro-Batch Stream Processing


Kafka Data Source supports Micro-Batch Stream Processing (i.e. Trigger.Once and
Trigger.ProcessingTime triggers) via KafkaMicroBatchReader.

330
Kafka Data Source

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("kafka")
.option("subscribepattern", "kafka2console.*")
.option("kafka.bootstrap.servers", ":9092")
.load
.withColumn("value", $"value" cast "string") // deserializing values
.writeStream
.format("console")
.option("truncate", false) // format-specific option
.option("checkpointLocation", "checkpointLocation-kafka2console") // generic query o
ption
.trigger(Trigger.ProcessingTime(30.seconds))
.queryName("kafka2console-microbatch")
.start

// In the end, stop the streaming query


sq.stop

Kafka Data Source can assign a single task per Kafka partition (using
KafkaOffsetRangeCalculator in Micro-Batch Stream Processing).

Kafka Data Source can reuse a Kafka consumer (using KafkaMicroBatchReader in Micro-
Batch Stream Processing).

Continuous Stream Processing


Kafka Data Source supports Continuous Stream Processing (i.e. Trigger.Continuous trigger)
via KafkaContinuousReader.

331
Kafka Data Source

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("kafka")
.option("subscribepattern", "kafka2console.*")
.option("kafka.bootstrap.servers", ":9092")
.load
.withColumn("value", $"value" cast "string") // convert bytes to string for display
purposes
.writeStream
.format("console")
.option("truncate", false) // format-specific option
.option("checkpointLocation", "checkpointLocation-kafka2console") // generic query o
ption
.queryName("kafka2console-continuous")
.trigger(Trigger.Continuous(10.seconds))
.start

// In the end, stop the streaming query


sq.stop

Configuration Options

Options with kafka. prefix (e.g. kafka.bootstrap.servers) are considered


Note configuration properties for the Kafka consumers used on the driver and
executors.

Table 2. Kafka Data Source’s Options (Case-Insensitive)


Option Description
Topic subscription strategy that accepts a JSON with topic names
and partitions, e.g.

{"topicA":[0,1],"topicB":[0,1]}
assign

Exactly one topic subscription strategy is allowed


Note (that KafkaSourceProvider validates before creating
KafkaSource ).

Flag to control whether…​FIXME

failOnDataLoss Default: true


Used when KafkaSourceProvider is requested for failOnDataLoss
configuration property

(required) bootstrap.servers configuration property of the Kafka


kafka.bootstrap.servers consumers used on the driver and executors

332
Kafka Data Source

Default: (empty)

The time (in milliseconds) spent waiting in Consumer.poll


is not available in the buffer.
kafkaConsumer.pollTimeoutMs
Default: spark.network.timeout or 120s
Used when…​FIXME

Number of records to fetch per trigger (to limit the number of


records to fetch).
maxOffsetsPerTrigger Default: (undefined)
Unless defined, KafkaSource requests KafkaOffsetReader
latest offsets.

Minimum number of partitions per executor (given Kafka


partitions)
Default: (undefined)
Must be undefined (default) or greater than 0
minPartitions
When undefined (default) or smaller than the number of
TopicPartitions with records to consume from,
KafkaMicroBatchReader uses KafkaOffsetRangeCalculator
the preferred executor for every TopicPartition (and the
available executors).

Starting offsets
Default: latest
Possible values:

latest

earliest

JSON with topics, partitions and their starting offsets, e.g.

startingOffsets {"topicA":{"part":offset,"p1":-1},"topicB":{"0":-2}}

Use Scala’s tripple quotes for the JSON for topics,


partitions and offsets.

option(
Tip "startingOffsets",
"""{"topic1":{"0":5,"4":-1},"topic2":{"0":-2}}"
)

333
Kafka Data Source

Topic subscription strategy that accepts topic names as a comma-


separated string, e.g.

topic1,topic2,topic3
subscribe

Exactly one topic subscription strategy is allowed


Note (that KafkaSourceProvider validates before creating
KafkaSource ).

Topic subscription strategy that uses Java’s java.util.regex.Pattern


for the topic subscription regex pattern of topics to subscribe to,
e.g.

topic\d

Use Scala’s tripple quotes for the regular expression


subscribepattern for topic subscription regex pattern.
Tip
option("subscribepattern", """topic\d""")

Exactly one topic subscription strategy is allowed


Note (that KafkaSourceProvider validates before creating
KafkaSource ).

Optional topic name to use for writing a streaming query

topic
Default: (empty)

Unless defined, Kafka data source uses the topic names as


defined in the topic field in the incoming data.

Logical Query Plan for Reading


When DataStreamReader is requested to load a dataset with kafka data source format, it
creates a DataFrame with a StreamingRelationV2 leaf logical operator.

334
Kafka Data Source

scala> records.explain(extended = true)


== Parsed Logical Plan ==
StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider@1a366d0, kafka,
Map(maxOffsetsPerTrigger -> 1, startingOffsets -> latest, subscribepattern -> topic\d,
kafka.bootstrap.servers -> :9092), [key#7, value#8, topic#9, partition#10, offset#11L
, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.S
parkSession@39b3de87,kafka,List(),None,List(),None,Map(maxOffsetsPerTrigger -> 1, star
tingOffsets -> latest, subscribepattern -> topic\d, kafka.bootstrap.servers -> :9092),
None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestamp
Type#6]
...

Logical Query Plan for Writing


When DataStreamWriter is requested to start a streaming query with kafka data source
format for writing, it requests the StreamingQueryManager to create a streaming query that in
turn creates (a StreamingQueryWrapper with) a ContinuousExecution or a
MicroBatchExecution for continuous and micro-batch stream processing, respectively.

scala> sq.explain(extended = true)


== Parsed Logical Plan ==
WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@
bf98b73
+- Project [key#28 AS key#7, value#29 AS value#8, topic#30 AS topic#9, partition#31 AS
partition#10, offset#32L AS offset#11L, timestamp#33 AS timestamp#12, timestampType#34
AS timestampType#13]
+- Streaming RelationV2 kafka[key#28, value#29, topic#30, partition#31, offset#32L,
timestamp#33, timestampType#34] (Options: [subscribepattern=kafka2console.*,kafka.boo
tstrap.servers=:9092])

Demo: Streaming Aggregation with Kafka Data Source


Check out Demo: Streaming Aggregation with Kafka Data Source.

Use the following to publish events to Kafka.

// 1st streaming batch


$ cat /tmp/1
1,1,1
15,2,1
Tip
$ kafkacat -P -b localhost:9092 -t topic1 -l /tmp/1

// Alternatively (and slower due to JVM bootup)


$ cat /tmp/1 | ./bin/kafka-console-producer.sh --topic topic1 --broker-list localhost:9092

335
Kafka Data Source

336
KafkaSourceProvider

KafkaSourceProvider — Data Source Provider


for Apache Kafka
KafkaSourceProvider is a DataSourceRegister and registers a developer-friendly alias for

kafka data source format in Spark Structured Streaming.

Tip Read up on DataSourceRegister in The Internals of Spark SQL book.

KafkaSourceProvider supports micro-batch stream processing (through

MicroBatchReadSupport contract) and creates a specialized KafkaMicroBatchReader.

KafkaSourceProvider requires the following options (that you can set using option method

of DataStreamReader or DataStreamWriter):

Exactly one of the following options: subscribe, subscribePattern or assign

kafka.bootstrap.servers

Tip Refer to Kafka Data Source’s Options for the supported configuration options.

Internally, KafkaSourceProvider sets the properties for Kafka Consumers on executors (that
are passed on to InternalKafkaConsumer when requested to create a Kafka consumer with a
single TopicPartition manually assigned).

Table 1. KafkaSourceProvider’s Properties for Kafka Consumers on Executors


ConsumerConfig’s Key Value Description
KEY_DESERIALIZER_CLASS_CONFIG ByteArrayDeserializer FIXME

VALUE_DESERIALIZER_CLASS_CONFIG ByteArrayDeserializer FIXME

AUTO_OFFSET_RESET_CONFIG none FIXME

uniqueGroupId-
GROUP_ID_CONFIG
executor FIXME

ENABLE_AUTO_COMMIT_CONFIG false FIXME

Only when not set in


RECEIVE_BUFFER_CONFIG 65536
the
specifiedKafkaParams
already

337
KafkaSourceProvider

Enable ALL logging levels for


org.apache.spark.sql.kafka010.KafkaSourceProvider logger to see what happens
inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.kafka010.KafkaSourceProvider=ALL

Refer to Logging.

Creating Streaming Source —  createSource Method

createSource(
sqlContext: SQLContext,
metadataPath: String,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): Source

createSource is part of the StreamSourceProvider Contract to create a


Note
streaming source for a format or system (to continually read data).

createSource first validates stream options.

createSource …​FIXME

Validating General Options For Batch And Streaming


Queries —  validateGeneralOptions Internal Method

validateGeneralOptions(parameters: Map[String, String]): Unit

Note Parameters are case-insensitive, i.e. OptioN and option are equal.

validateGeneralOptions makes sure that exactly one topic subscription strategy is used in

parameters and can be:

subscribe

subscribepattern

assign

338
KafkaSourceProvider

validateGeneralOptions reports an IllegalArgumentException when there is no subscription

strategy in use or there are more than one strategies used.

validateGeneralOptions makes sure that the value of subscription strategies meet the

requirements:

assign strategy starts with { (the opening curly brace)

subscribe strategy has at least one topic (in a comma-separated list of topics)

subscribepattern strategy has the pattern defined

validateGeneralOptions makes sure that group.id has not been specified and reports an

IllegalArgumentException otherwise.

Kafka option 'group.id' is not supported as user-specified consumer groups are not use
d to track offsets.

validateGeneralOptions makes sure that auto.offset.reset has not been specified and

reports an IllegalArgumentException otherwise.

Kafka option 'auto.offset.reset' is not supported.


Instead set the source option 'startingoffsets' to 'earliest' or
'latest' to specify where to start. Structured Streaming manages
which offsets are consumed internally, rather than relying on
the kafkaConsumer to do it. This will ensure that no data is
missed when new topics/partitions are dynamically subscribed.
Note that 'startingoffsets' only applies when a new Streaming
query is started, and
that resuming will always pick up from where the query left off.
See the docs for more details.

validateGeneralOptions makes sure that the following options have not been specified and

reports an IllegalArgumentException otherwise:

kafka.key.deserializer

kafka.value.deserializer

kafka.enable.auto.commit

kafka.interceptor.classes

In the end, validateGeneralOptions makes sure that kafka.bootstrap.servers option was


specified and reports an IllegalArgumentException otherwise.

339
KafkaSourceProvider

Option 'kafka.bootstrap.servers' must be specified for configuring Kafka consumer

validateGeneralOptions is used when KafkaSourceProvider validates options


Note
for streaming and batch queries.

Creating ConsumerStrategy —  strategy Internal Method

strategy(caseInsensitiveParams: Map[String, String])

Internally, strategy finds the keys in the input caseInsensitiveParams that are one of the
following and creates a corresponding ConsumerStrategy.

Table 2. KafkaSourceProvider.strategy’s Key to ConsumerStrategy Conversion


Key ConsumerStrategy
AssignStrategy with Kafka’s TopicPartitions.

strategy uses JsonUtils.partitions method to parse a

assign JSON with topic names and partitions, e.g.

{"topicA":[0,1],"topicB":[0,1]}

The topic names and partitions are mapped directly to


Kafka’s TopicPartition objects.

SubscribeStrategy with topic names

subscribe strategy extracts topic names from a comma-separated


string, e.g.

topic1,topic2,topic3

SubscribePatternStrategy with topic subscription regex


pattern (that uses Java’s java.util.regex.Pattern for the
pattern), e.g.
subscribepattern

topic\d

340
KafkaSourceProvider

strategy is used when:

KafkaSourceProvider creates a KafkaOffsetReader for KafkaSource.


Note
KafkaSourceProvider creates a KafkaRelation (using createRelation
method).

Describing Streaming Source with Name and Schema 


—  sourceSchema Method

sourceSchema(
sqlContext: SQLContext,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): (String, StructType)

sourceSchema is part of the StreamSourceProvider Contract to describe a


Note
streaming source with a name and the schema.

sourceSchema gives the short name (i.e. kafka ) and the fixed schema.

Internally, sourceSchema validates Kafka options and makes sure that the optional input
schema is indeed undefined.

When the input schema is defined, sourceSchema reports a IllegalArgumentException .

Kafka source has a fixed schema and cannot be set with a custom one

Validating Kafka Options for Streaming Queries 


—  validateStreamOptions Internal Method

validateStreamOptions(caseInsensitiveParams: Map[String, String]): Unit

Firstly, validateStreamOptions makes sure that endingoffsets option is not used.


Otherwise, validateStreamOptions reports a IllegalArgumentException .

ending offset not valid in streaming queries

validateStreamOptions then validates the general options.

validateStreamOptions is used when KafkaSourceProvider is requested the


Note
schema for Kafka source and to create a KafkaSource.

341
KafkaSourceProvider

Creating ContinuousReader for Continuous Stream


Processing —  createContinuousReader Method

createContinuousReader(
schema: Optional[StructType],
metadataPath: String,
options: DataSourceOptions): KafkaContinuousReader

createContinuousReader is part of the ContinuousReadSupport Contract to


Note
create a ContinuousReader.

createContinuousReader …​FIXME

Converting Configuration Options to


KafkaOffsetRangeLimit —  getKafkaOffsetRangeLimit
Object Method

getKafkaOffsetRangeLimit(
params: Map[String, String],
offsetOptionKey: String,
defaultOffsets: KafkaOffsetRangeLimit): KafkaOffsetRangeLimit

getKafkaOffsetRangeLimit finds the given offsetOptionKey in the params and does the

following conversion:

latest becomes LatestOffsetRangeLimit

earliest becomes EarliestOffsetRangeLimit

A JSON-formatted text becomes SpecificOffsetRangeLimit

When the given offsetOptionKey is not found, getKafkaOffsetRangeLimit returns the


given defaultOffsets

getKafkaOffsetRangeLimit is used when KafkaSourceProvider is requested to


Note createSource, createMicroBatchReader, createContinuousReader,
createRelation, and validateBatchOptions.

Creating MicroBatchReader for Micro-Batch Stream


Processing —  createMicroBatchReader Method

342
KafkaSourceProvider

createMicroBatchReader(
schema: Optional[StructType],
metadataPath: String,
options: DataSourceOptions): KafkaMicroBatchReader

createMicroBatchReader is part of the MicroBatchReadSupport Contract to


Note
create a MicroBatchReader in Micro-Batch Stream Processing.

createMicroBatchReader validateStreamOptions (in the given DataSourceOptions ).

createMicroBatchReader generates a unique group ID of the format spark-kafka-source-

[randomUUID]-[metadataPath_hashCode] (to make sure that a new streaming query


creates a new consumer group).

createMicroBatchReader finds all the parameters (in the given DataSourceOptions ) that start

with kafka. prefix, removes the prefix, and creates the current Kafka parameters.

createMicroBatchReader creates a KafkaOffsetReader with the following:

strategy (in the given DataSourceOptions )

Properties for Kafka consumers on the driver (given the current Kafka parameters, i.e.
without kafka. prefix)

The given DataSourceOptions

spark-kafka-source-[randomUUID]-[metadataPath_hashCode]-driver for the


driverGroupIdPrefix

In the end, createMicroBatchReader creates a KafkaMicroBatchReader with the following:

the KafkaOffsetReader

Properties for Kafka consumers on executors (given the current Kafka parameters, i.e.
without kafka. prefix) and the unique group ID ( spark-kafka-source-[randomUUID]-
[metadataPath_hashCode]-driver )

The given DataSourceOptions and the metadataPath

Starting stream offsets (startingOffsets option with the default of


LatestOffsetRangeLimit offsets)

failOnDataLoss configuration property

Creating BaseRelation —  createRelation Method

343
KafkaSourceProvider

createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation

createRelation is part of the RelationProvider contract to create a


Note
BaseRelation .

createRelation …​FIXME

Validating Configuration Options for Batch Processing 


—  validateBatchOptions Internal Method

validateBatchOptions(caseInsensitiveParams: Map[String, String]): Unit

validateBatchOptions …​FIXME

validateBatchOptions is used exclusively when KafkaSourceProvider is


Note
requested to createSource.

kafkaParamsForDriver Method

kafkaParamsForDriver(specifiedKafkaParams: Map[String, String]): Map[String, Object]

kafkaParamsForDriver …​FIXME

Note kafkaParamsForDriver is used when…​FIXME

kafkaParamsForExecutors Method

kafkaParamsForExecutors(
specifiedKafkaParams: Map[String, String],
uniqueGroupId: String): Map[String, Object]

kafkaParamsForExecutors sets the Kafka properties for executors.

While setting the properties, kafkaParamsForExecutors prints out the following DEBUG
message to the logs:

executor: Set [key] to [value], earlier value: [value]

344
KafkaSourceProvider

kafkaParamsForExecutors is used when:

KafkaSourceProvider is requested to createSource (for a KafkaSource),

Note createMicroBatchReader (for a KafkaMicroBatchReader), and


createContinuousReader (for a KafkaContinuousReader)
KafkaRelation is requested to buildScan (for a KafkaSourceRDD )

Looking Up failOnDataLoss Configuration Property 


—  failOnDataLoss Internal Method

failOnDataLoss(caseInsensitiveParams: Map[String, String]): Boolean

failOnDataLoss simply looks up the failOnDataLoss configuration property in the given

caseInsensitiveParams (in case-insensitive manner) or defaults to true .

failOnDataLoss is used when KafkaSourceProvider is requested to


createSource (for a KafkaSource), createMicroBatchReader (for a
Note
KafkaMicroBatchReader), createContinuousReader (for a
KafkaContinuousReader), and createRelation (for a KafkaRelation).

345
KafkaSource

KafkaSource
KafkaSource is a streaming source that generates DataFrames of records from one or more

topics in Apache Kafka.

Kafka topics are checked for new records every trigger and so there is some
Note noticeable delay between when the records have arrived to Kafka topics and
when a Spark application processes them.

KafkaSource uses the streaming metadata log directory to persist offsets. The directory is

the source ID under the sources directory in the checkpointRoot (of the StreamExecution).

The checkpointRoot directory is one of the following:

Note checkpointLocation option

spark.sql.streaming.checkpointLocation configuration property

KafkaSource is created for kafka format (that is registered by KafkaSourceProvider).

Figure 1. KafkaSource Is Created for kafka Format by KafkaSourceProvider


KafkaSource uses a predefined (fixed) schema (that cannot be changed).

KafkaSource also supports batch Datasets.

Enable ALL logging level for org.apache.spark.sql.kafka010.KafkaSource to see


what happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.kafka010.KafkaSource=ALL

Refer to Logging.

Creating KafkaSource Instance


KafkaSource takes the following to be created:

SQLContext

KafkaOffsetReader

346
KafkaSource

Parameters of executors (reading from Kafka)

Collection of key-value options

Streaming metadata log directory, i.e. the directory for streaming metadata log
(where KafkaSource persists KafkaSourceOffset offsets in JSON format)

Starting offsets (as defined using startingOffsets option)

Flag used to create KafkaSourceRDDs every trigger and when checking to report a
IllegalStateException on data loss.

KafkaSource initializes the internal properties.

Generating Streaming DataFrame with Records From Kafka


for Streaming Micro-Batch —  getBatch Method

getBatch(
start: Option[Offset],
end: Offset): DataFrame

getBatch is part of the Source Contract to generate a streaming DataFrame


Note
with data between the start and end offsets.

getBatch creates a streaming DataFrame with a query plan with LogicalRDD logical

operator to scan data from a KafkaSourceRDD.

Internally, getBatch initializes initial partition offsets (unless initialized already).

You should see the following INFO message in the logs:

GetBatch called with start = [start], end = [end]

getBatch requests KafkaSourceOffset for end partition offsets for the input end offset

(known as untilPartitionOffsets ).

getBatch requests KafkaSourceOffset for start partition offsets for the input start offset (if

defined) or uses initial partition offsets (known as fromPartitionOffsets ).

getBatch finds the new partitions (as the difference between the topic partitions in

untilPartitionOffsets and fromPartitionOffsets ) and requests KafkaOffsetReader to

fetch their earliest offsets.

getBatch reports a data loss if the new partitions don’t match to what KafkaOffsetReader

fetched.

347
KafkaSource

Cannot find earliest offsets of [partitions]. Some data may have been missed

You should see the following INFO message in the logs:

Partitions added: [newPartitionOffsets]

getBatch reports a data loss if the new partitions don’t have their offsets 0 .

Added partition [partition] starts from [offset] instead of 0. Some data may have been
missed

getBatch reports a data loss if the fromPartitionOffsets partitions differ from

untilPartitionOffsets partitions.

[partitions] are gone. Some data may have been missed

You should see the following DEBUG message in the logs:

TopicPartitions: [topicPartitions]

getBatch gets the executors (sorted by executorId and host of the registered block

managers).

That is when getBatch goes very low-level to allow for cached


Important KafkaConsumers in the executors to be re-used to read the same partition
in every batch (aka location preference).

You should see the following DEBUG message in the logs:

Sorted executors: [sortedExecutors]

getBatch creates a KafkaSourceRDDOffsetRange per TopicPartition .

getBatch filters out KafkaSourceRDDOffsetRanges for which until offsets are smaller than

from offsets. getBatch reports a data loss if they are found.

Partition [topicPartition]'s offset was changed from [fromOffset] to [untilOffset], so


me data may have been missed

getBatch creates a KafkaSourceRDD (with executorKafkaParams, pollTimeoutMs and

reuseKafkaConsumer flag enabled) and maps it to an RDD of InternalRow .

348
KafkaSource

getBatch creates a KafkaSourceRDD with reuseKafkaConsumer flag


Important
enabled.

You should see the following INFO message in the logs:

GetBatch generating RDD of offset range: [offsetRanges]

getBatch sets currentPartitionOffsets if it was empty (which is when…​FIXME)

In the end, getBatch creates a streaming DataFrame for the KafkaSourceRDD and the
schema.

Fetching Offsets (From Metadata Log or Kafka Directly) 


—  getOffset Method

getOffset: Option[Offset]

Note getOffset is a part of the Source Contract.

Internally, getOffset fetches the initial partition offsets (from the metadata log or Kafka
directly).

Figure 2. KafkaSource Initializing initialPartitionOffsets While Fetching Initial Offsets


initialPartitionOffsets is a lazy value and is initialized the very first time
Note getOffset is called (which is when StreamExecution constructs a streaming
micro-batch).

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

// Case 1: Checkpoint directory undefined


// initialPartitionOffsets read from Kafka directly
val records = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load
// Start the streaming query

349
KafkaSource

// dump records to the console every 10 seconds


import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val q = records.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update).
start
// Note the temporary checkpoint directory
17/08/07 11:09:29 INFO StreamExecution: Starting [id = 75dd261d-6b62-40fc-a368-9d95d3c
b6f5f, runId = f18a5eb5-ccab-4d9d-8a81-befed41a72bd] with file:///private/var/folders/
0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/temporary-d0055630-24e4-4d9a-8f36-7a12a0f11bc0 to
store the query checkpoint.
...
INFO KafkaSource: Initial offsets: {"topic1":{"0":1}}

// Stop the streaming query


q.stop

// Case 2: Checkpoint directory defined


// initialPartitionOffsets read from Kafka directly
// since the checkpoint directory is not available yet
// it will be the next time the query is started
val records = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select($"value" cast "string", $"topic", $"partition", $"offset")
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val q = records.
writeStream.
format("console").
option("truncate", false).
option("checkpointLocation", "/tmp/checkpoint"). // <-- checkpoint directory
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update).
start
// Note the checkpoint directory in use
17/08/07 11:21:25 INFO StreamExecution: Starting [id = b8f59854-61c1-4c2f-931d-62bbaf9
0ee3b, runId = 70d06a3b-f2b1-4fa8-a518-15df4cf59130] with file:///tmp/checkpoint to st
ore the query checkpoint.
...
INFO KafkaSource: Initial offsets: {"topic1":{"0":1}}
...
INFO StreamExecution: Stored offsets for batch 0. Metadata OffsetSeqMetadata(0,1502098
526848,Map(spark.sql.shuffle.partitions -> 200, spark.sql.streaming.stateStore.provide
rClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider)
)

350
KafkaSource

// Review the checkpoint location


// $ ls -ltr /tmp/checkpoint/offsets
// total 8
// -rw-r--r-- 1 jacek wheel 248 7 sie 11:21 0
// $ tail -2 /tmp/checkpoint/offsets/0 | jq

// Produce messages to Kafka so the latest offset changes


// And more importanly the offset gets stored to checkpoint location
-------------------------------------------
Batch: 1
-------------------------------------------
+---------------------------+------+---------+------+
|value |topic |partition|offset|
+---------------------------+------+---------+------+
|testing checkpoint location|topic1|0 |2 |
+---------------------------+------+---------+------+

// and one more


// Note the offset
-------------------------------------------
Batch: 2
-------------------------------------------
+------------+------+---------+------+
|value |topic |partition|offset|
+------------+------+---------+------+
|another test|topic1|0 |3 |
+------------+------+---------+------+

// See what was checkpointed


// $ ls -ltr /tmp/checkpoint/offsets
// total 24
// -rw-r--r-- 1 jacek wheel 248 7 sie 11:35 0
// -rw-r--r-- 1 jacek wheel 248 7 sie 11:37 1
// -rw-r--r-- 1 jacek wheel 248 7 sie 11:38 2
// $ tail -2 /tmp/checkpoint/offsets/2 | jq

// Stop the streaming query


q.stop

// And start over to see what offset the query starts from
// Checkpoint location should have the offsets
val q = records.
writeStream.
format("console").
option("truncate", false).
option("checkpointLocation", "/tmp/checkpoint"). // <-- checkpoint directory
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update).
start
// Whoops...console format does not support recovery (!)
// Reported as https://issues.apache.org/jira/browse/SPARK-21667
org.apache.spark.sql.AnalysisException: This query does not support recovering from ch

351
KafkaSource

eckpoint location. Delete /tmp/checkpoint/offsets to start over.;


at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryMa
nager.scala:222)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryMan
ager.scala:278)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:284)
... 61 elided

// Change the sink (= output format) to JSON


val q = records.
writeStream.
format("json").
option("path", "/tmp/json-sink").
option("checkpointLocation", "/tmp/checkpoint"). // <-- checkpoint directory
trigger(Trigger.ProcessingTime(10.seconds)).
start
// Note the checkpoint directory in use
17/08/07 12:09:02 INFO StreamExecution: Starting [id = 02e00924-5f0d-4501-bcb8-80be8a8
be385, runId = 5eba2576-dad6-4f95-9031-e72514475edc] with file:///tmp/checkpoint to st
ore the query checkpoint.
...
17/08/07 12:09:02 INFO KafkaSource: GetBatch called with start = Some({"topic1":{"0":3
}}), end = {"topic1":{"0":4}}
17/08/07 12:09:02 INFO KafkaSource: Partitions added: Map()
17/08/07 12:09:02 DEBUG KafkaSource: TopicPartitions: topic1-0
17/08/07 12:09:02 DEBUG KafkaSource: Sorted executors:
17/08/07 12:09:02 INFO KafkaSource: GetBatch generating RDD of offset range: KafkaSour
ceRDDOffsetRange(topic1-0,3,4,None)
17/08/07 12:09:03 DEBUG KafkaOffsetReader: Partitions assigned to consumer: [topic1-0]
. Seeking to the end.
17/08/07 12:09:03 DEBUG KafkaOffsetReader: Got latest offsets for partition : Map(topi
c1-0 -> 4)
17/08/07 12:09:03 DEBUG KafkaSource: GetOffset: ArrayBuffer((topic1-0,4))
17/08/07 12:09:03 DEBUG StreamExecution: getOffset took 122 ms
17/08/07 12:09:03 DEBUG StreamExecution: Resuming at batch 3 with committed offsets {K
afkaSource[Subscribe[topic1]]: {"topic1":{"0":4}}} and available offsets {KafkaSource[
Subscribe[topic1]]: {"topic1":{"0":4}}}
17/08/07 12:09:03 DEBUG StreamExecution: Stream running from {KafkaSource[Subscribe[to
pic1]]: {"topic1":{"0":4}}} to {KafkaSource[Subscribe[topic1]]: {"topic1":{"0":4}}}

getOffset requests KafkaOffsetReader to fetchLatestOffsets (known later as latest ).

(Possible performance degradation?) It is possible that getOffset will request


the latest offsets from Kafka twice, i.e. while initializing initialPartitionOffsets
Note
(when no metadata log is available and KafkaSource’s KafkaOffsetRangeLimit
is LatestOffsetRangeLimit ) and always as part of getOffset itself.

getOffset then calculates currentPartitionOffsets based on the maxOffsetsPerTrigger

option.

352
KafkaSource

Table 1. getOffset’s Offset Calculation per maxOffsetsPerTrigger


maxOffsetsPerTrigger Offsets

Unspecified (i.e. None ) latest

Defined (but currentPartitionOffsets is rateLimit with limit limit,


empty) initialPartitionOffsets as from , until as
latest

rateLimit with limit limit,


Defined (and currentPartitionOffsets
currentPartitionOffsets as from , until
contains partitions and offsets)
as latest

You should see the following DEBUG message in the logs:

DEBUG KafkaSource: GetOffset: [offsets]

In the end, getOffset creates a KafkaSourceOffset with offsets (as Map[TopicPartition,


Long] ).

Fetching and Verifying Specific Offsets 


—  fetchAndVerify Internal Method

fetchAndVerify(specificOffsets: Map[TopicPartition, Long]): KafkaSourceOffset

fetchAndVerify requests KafkaOffsetReader to fetchSpecificOffsets for the given

specificOffsets .

fetchAndVerify makes sure that the starting offsets in specificOffsets are the same as in

Kafka and reports a data loss otherwise.

startingOffsets for [tp] was [off] but consumer reset to [result(tp)]

In the end, fetchAndVerify creates a KafkaSourceOffset (with the result of


KafkaOffsetReader).

fetchAndVerify is used exclusively when KafkaSource initializes initial partition


Note
offsets.

Initial Partition Offsets (of 0th Batch) 


—  initialPartitionOffsets Internal Lazy Property

353
KafkaSource

initialPartitionOffsets: Map[TopicPartition, Long]

initialPartitionOffsets is the initial partition offsets for the batch 0 that were already

persisted in the streaming metadata log directory or persisted on demand.

As the very first step, initialPartitionOffsets creates a custom HDFSMetadataLog (of


KafkaSourceOffsets metadata) in the streaming metadata log directory.

initialPartitionOffsets requests the HDFSMetadataLog for the metadata of the 0 th batch

(as KafkaSourceOffset ).

If the metadata is available, initialPartitionOffsets requests the metadata for the


collection of TopicPartitions and their offsets.

If the metadata could not be found, initialPartitionOffsets creates a new


KafkaSourceOffset per KafkaOffsetRangeLimit:

For EarliestOffsetRangeLimit , initialPartitionOffsets requests the


KafkaOffsetReader to fetchEarliestOffsets

For LatestOffsetRangeLimit , initialPartitionOffsets requests the KafkaOffsetReader


to fetchLatestOffsets

For SpecificOffsetRangeLimit , initialPartitionOffsets requests the


KafkaOffsetReader to fetchSpecificOffsets (and report a data loss per the
failOnDataLoss flag)

initialPartitionOffsets requests the custom HDFSMetadataLog to add the offsets to the

metadata log (as the metadata of the 0 th batch).

initialPartitionOffsets prints out the following INFO message to the logs:

Initial offsets: [offsets]

initialPartitionOffsets is used when KafkaSource is requested for the


following:
Fetch offsets (from metadata log or Kafka directly)
Note Generate a DataFrame with records from Kafka for a streaming batch
(when the start offsets are not defined, i.e. before StreamExecution
commits the first streaming batch and so nothing is in committedOffsets
registry for a KafkaSource data source yet)

HDFSMetadataLog.serialize

354
KafkaSource

serialize(
metadata: KafkaSourceOffset,
out: OutputStream): Unit

Note serialize is part of the HDFSMetadataLog Contract to…​FIXME.

serialize requests the OutputStream to write a zero byte (to support Spark 2.1.0 as per

SPARK-19517).

serialize creates a BufferedWriter over a OutputStreamWriter over the OutputStream

(with UTF_8 charset encoding).

serialize requests the BufferedWriter to write the v1 version indicator followed by a new

line.

serialize then requests the KafkaSourceOffset for a JSON-serialized representation and

the BufferedWriter to write it out.

In the end, serialize requests the BufferedWriter to flush (the underlying stream).

rateLimit Internal Method

rateLimit(
limit: Long,
from: Map[TopicPartition, Long],
until: Map[TopicPartition, Long]): Map[TopicPartition, Long]

rateLimit requests KafkaOffsetReader to fetchEarliestOffsets.

Caution FIXME

rateLimit is used exclusively when KafkaSource gets available offsets (when


Note
maxOffsetsPerTrigger option is specified).

getSortedExecutorList Method

Caution FIXME

reportDataLoss Internal Method

Caution FIXME

355
KafkaSource

reportDataLoss is used when KafkaSource does the following:

Note fetches and verifies specific offsets


generates a DataFrame with records from Kafka for a batch

Internal Properties

Name Description
Current partition offsets (as Map[TopicPartition, Long] )
currentPartitionOffsets Initially NONE and set when KafkaSource is requested to
get the maximum available offsets or generate a
DataFrame with records from Kafka for a batch.

pollTimeoutMs

Spark Core’s SparkContext (of the SQLContext)


Used when:
sc Generating a DataFrame with records from Kafka for a
streaming micro-batch (and creating a
KafkaSourceRDD)
Initializing the pollTimeoutMs internal property

356
KafkaRelation

KafkaRelation
KafkaRelation represents a collection of rows with a predefined schema ( BaseRelation )

that supports column pruning ( TableScan ).

Read up on BaseRelation and TableScan in The Internals of Spark SQL online


Tip
book.

KafkaRelation is created exclusively when KafkaSourceProvider is requested to create a

BaseRelation.

Table 1. KafkaRelation’s Options


Name Description

kafkaConsumer.pollTimeoutMs Default: spark.network.timeout configuration if set or


120s

Enable ALL logging levels for org.apache.spark.sql.kafka010.KafkaRelation to


see what happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.kafka010.KafkaRelation=ALL

Refer to Logging.

Creating KafkaRelation Instance


KafkaRelation takes the following when created:

SQLContext

ConsumerStrategy

Source options ( Map[String, String] )

User-defined Kafka parameters ( Map[String, String] )

failOnDataLoss flag

Starting offsets

Ending offsets

357
KafkaRelation

getPartitionOffsets Internal Method

getPartitionOffsets(
kafkaReader: KafkaOffsetReader,
kafkaOffsets: KafkaOffsetRangeLimit): Map[TopicPartition, Long]

Caution FIXME

getPartitionOffsets is used exclusively when KafkaRelation builds RDD of


Note
rows (from the tuples).

Building Distributed Data Scan with Column Pruning 


—  buildScan Method

buildScan(): RDD[Row]

buildScan is part of the TableScan contract to build a distributed data scan with
Note
column pruning.

buildScan generates a unique group ID of the format spark-kafka-relation-[randomUUID]

(to make sure that a streaming query creates a new consumer group).

buildScan creates a KafkaOffsetReader with the following:

The given ConsumerStrategy and the source options

Kafka parameters for the driver based on the given specifiedKafkaParams

spark-kafka-relation-[randomUUID]-driver for the driverGroupIdPrefix

buildScan uses the KafkaOffsetReader to getPartitionOffsets for the starting and ending

offsets (based on the given KafkaOffsetRangeLimit and the KafkaOffsetRangeLimit,


respectively). buildScan requests the KafkaOffsetReader to close afterwards.

buildScan creates offset ranges (that are a collection of KafkaSourceRDDOffsetRanges with a

Kafka TopicPartition , beginning and ending offsets and undefined preferred location).

buildScan prints out the following INFO message to the logs:

Generating RDD of offset ranges: [offsetRanges]

buildScan creates a KafkaSourceRDD with the following:

358
KafkaRelation

Kafka parameters for executors based on the given specifiedKafkaParams and the
unique group ID ( spark-kafka-relation-[randomUUID] )

The offset ranges created

pollTimeoutMs configuration

The given failOnDataLoss flag

reuseKafkaConsumer flag off ( false )

buildScan requests the KafkaSourceRDD to map Kafka ConsumerRecords to InternalRows .

In the end, buildScan requests the SQLContext to create a DataFrame (with the name
kafka and the predefined schema) that is immediately converted to a RDD[InternalRow] .

buildScan throws a IllegalStateException when…​FIXME

different topic partitions for starting offsets topics[[fromTopics]] and ending offset
s topics[[untilTopics]]

buildScan throws a IllegalStateException when…​FIXME

[tp] doesn't have a from offset

359
KafkaSourceRDD

KafkaSourceRDD
KafkaSourceRDD is an RDD of Kafka’s ConsumerRecords ( RDD[ConsumerRecord[Array[Byte],

Array[Byte]]] ) and no parent RDDs.

KafkaSourceRDD is created when:

KafkaRelation is requested to build a distributed data scan with column pruning

KafkaSource is requested to generate a streaming DataFrame with records from Kafka

for a streaming micro-batch

Creating KafkaSourceRDD Instance


KafkaSourceRDD takes the following when created:

SparkContext

Collection of key-value settings for executors reading records from Kafka topics

Collection of KafkaSourceRDDOffsetRange offsets

Timeout (in milliseconds) to poll data from Kafka

Used when KafkaSourceRDD is requested for records (for given offsets) and in turn
requests CachedKafkaConsumer to poll for Kafka’s ConsumerRecords .

Flag to…​FIXME

Flag to…​FIXME

Placement Preferences of Partition (Preferred Locations) 


—  getPreferredLocations Method

getPreferredLocations(
split: Partition): Seq[String]

getPreferredLocations is part of the RDD contract to specify placement


Note
preferences.

getPreferredLocations converts the given Partition to a KafkaSourceRDDPartition and…​

FIXME

360
KafkaSourceRDD

Computing Partition —  compute Method

compute(
thePart: Partition,
context: TaskContext
): Iterator[ConsumerRecord[Array[Byte], Array[Byte]]]

Note compute is part of the RDD contract to compute a given partition.

compute uses KafkaDataConsumer utility to acquire a cached KafkaDataConsumer (for a

partition).

compute resolves the range (based on the offsetRange of the given partition that is

assumed a KafkaSourceRDDPartition ).

compute returns a NextIterator so that getNext uses the KafkaDataConsumer to get a

record.

When the beginning and ending offsets (of the offset range) are equal, compute prints out
the following INFO message to the logs, requests the KafkaDataConsumer to release and
returns an empty iterator.

Beginning offset [fromOffset] is the same as ending offset skipping [topic] [partition]

compute throws an AssertionError when the beginning offset ( fromOffset ) is after the

ending offset ( untilOffset ):

Beginning offset [fromOffset] is after the ending offset


[untilOffset] for topic [topic] partition [partition]. You
either provided an invalid fromOffset, or the Kafka topic has
been damaged

getPartitions Method

getPartitions: Array[Partition]

Note getPartitions is part of the RDD contract to…​FIXME.

getPartitions …​FIXME

361
KafkaSourceRDD

Persisting RDD —  persist Method

persist: Array[Partition]

Note persist is part of the RDD contract to persist an RDD.

persist …​FIXME

resolveRange Internal Method

resolveRange(
consumer: KafkaDataConsumer,
range: KafkaSourceRDDOffsetRange
): KafkaSourceRDDOffsetRange

resolveRange …​FIXME

Note resolveRange is used when…​FIXME

362
CachedKafkaConsumer

CachedKafkaConsumer
Caution FIXME

poll Internal Method

Caution FIXME

fetchData Internal Method

Caution FIXME

363
KafkaSourceOffset

KafkaSourceOffset
KafkaSourceOffset is a custom Offset for kafka data source.

KafkaSourceOffset is created (directly or indirectly using apply) when:

KafkaContinuousReader is requested to setStartOffset, deserializeOffset, and

mergeOffsets

KafkaMicroBatchReader is requested to getStartOffset, getEndOffset, deserializeOffset,

and getOrCreateInitialPartitionOffsets

KafkaOffsetReader is requested to fetchSpecificOffsets

KafkaSource is requested for the initial partition offsets (of 0th batch) and getOffset

KafkaSourceInitialOffsetWriter is requested to deserialize a KafkaSourceOffset (from

an InputStream)

KafkaSourceOffset is requested for partition offsets

KafkaSourceOffset takes a collection of Kafka TopicPartitions with offsets to be created.

Partition Offsets —  getPartitionOffsets Method

getPartitionOffsets(
offset: Offset): Map[TopicPartition, Long]

getPartitionOffsets takes KafkaSourceOffset.partitionToOffsets from offset .

If offset is KafkaSourceOffset , getPartitionOffsets takes the partitions and offsets


straight from it.

If however offset is SerializedOffset , getPartitionOffsets deserializes the offsets from


JSON.

getPartitionOffsets reports an IllegalArgumentException when offset is neither

KafkaSourceOffset or SerializedOffset .

Invalid conversion from offset of [class] to KafkaSourceOffset

364
KafkaSourceOffset

getPartitionOffsets is used when:

KafkaContinuousReader is requested to planInputPartitions


Note
KafkaSource is requested to generate a streaming DataFrame with records
from Kafka for a streaming micro-batch

JSON-Encoded Offset —  json Method

json: String

Note json is part of the Offset Contract for a JSON-encoded offset.

json …​FIXME

Creating KafkaSourceOffset Instance —  apply Utility


Method

apply(
offsetTuples: (String, Int, Long)*): KafkaSourceOffset (1)
apply(
offset: SerializedOffset): KafkaSourceOffset

1. Used in tests only

apply …​FIXME

apply is used when:

KafkaSourceInitialOffsetWriter is requested to deserialize a


KafkaSourceOffset (from an InputStream)
Note
KafkaSource is requested for the initial partition offsets (of 0th batch)

KafkaSourceOffset is requested to getPartitionOffsets

365
KafkaOffsetReader

KafkaOffsetReader
KafkaOffsetReader relies on the ConsumerStrategy to create a Kafka Consumer.

KafkaOffsetReader creates a Kafka Consumer with group.id

( ConsumerConfig.GROUP_ID_CONFIG ) configuration explicitly set to nextGroupId (i.e. the given


driverGroupIdPrefix followed by nextId).

KafkaOffsetReader is created when:

KafkaRelation is requested to build a distributed data scan with column pruning

KafkaSourceProvider is requested to create a KafkaSource, createMicroBatchReader,

and createContinuousReader

Table 1. KafkaOffsetReader’s Options


Name Description
fetchOffset.numRetries Default: 3

How long to wait before retries


fetchOffset.retryIntervalMs
Default: 1000

KafkaOffsetReader defines the predefined fixed schema.

Enable ALL logging level for org.apache.spark.sql.kafka010.KafkaOffsetReader


to see what happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.kafka010.KafkaOffsetReader=ALL

Refer to Logging.

Creating KafkaOffsetReader Instance


KafkaOffsetReader takes the following to be created:

ConsumerStrategy

Kafka parameters (as name-value pairs that are used exclusively to create a Kafka
consumer

Options (as name-value pairs)

366
KafkaOffsetReader

Prefix of the group ID

KafkaOffsetReader initializes the internal properties.

nextGroupId Internal Method

nextGroupId(): String

nextGroupId sets the groupId to be the driverGroupIdPrefix, - followed by the nextId (i.e.

[driverGroupIdPrefix]-[nextId] ).

In the end, nextGroupId increments the nextId and returns the groupId.

nextGroupId is used exclusively when KafkaOffsetReader is requested for a


Note
Kafka Consumer.

resetConsumer Internal Method

resetConsumer(): Unit

resetConsumer …​FIXME

Note resetConsumer is used when…​FIXME

fetchTopicPartitions Method

fetchTopicPartitions(): Set[TopicPartition]

Caution FIXME

Note fetchTopicPartitions is used when KafkaRelation getPartitionOffsets.

Fetching Earliest Offsets —  fetchEarliestOffsets


Method

fetchEarliestOffsets(): Map[TopicPartition, Long]


fetchEarliestOffsets(newPartitions: Seq[TopicPartition]): Map[TopicPartition, Long]

Caution FIXME

367
KafkaOffsetReader

fetchEarliestOffsets is used when KafkaSource rateLimit and generates a


Note
DataFrame for a batch (when new partitions have been assigned).

Fetching Latest Offsets —  fetchLatestOffsets Method

fetchLatestOffsets(): Map[TopicPartition, Long]

Caution FIXME

fetchLatestOffsets is used when KafkaSource gets offsets or


Note
initialPartitionOffsets is initialized.

withRetriesWithoutInterrupt Internal Method

withRetriesWithoutInterrupt(
body: => Map[TopicPartition, Long]): Map[TopicPartition, Long]

withRetriesWithoutInterrupt …​FIXME

Note withRetriesWithoutInterrupt is used when…​FIXME

Fetching Offsets for Selected TopicPartitions 


—  fetchSpecificOffsets Method

fetchSpecificOffsets(
partitionOffsets: Map[TopicPartition, Long],
reportDataLoss: String => Unit): KafkaSourceOffset

Figure 1. KafkaOffsetReader’s fetchSpecificOffsets


fetchSpecificOffsets requests the Kafka Consumer to poll(0) .

368
KafkaOffsetReader

fetchSpecificOffsets requests the Kafka Consumer for assigned partitions (using

Consumer.assignment() ).

fetchSpecificOffsets requests the Kafka Consumer to pause(partitions) .

You should see the following DEBUG message in the logs:

DEBUG KafkaOffsetReader: Partitions assigned to consumer: [partitions]. Seeking to [pa


rtitionOffsets]

For every partition offset in the input partitionOffsets , fetchSpecificOffsets requests the
Kafka Consumer to:

seekToEnd for the latest (aka -1 )

seekToBeginning for the earliest (aka -2 )

seek for other offsets

In the end, fetchSpecificOffsets creates a collection of Kafka’s TopicPartition and


position (using the Kafka Consumer).

fetchSpecificOffsets is used when KafkaSource fetches and verifies initial


Note
partition offsets.

Creating Kafka Consumer —  createConsumer Internal


Method

createConsumer(): Consumer[Array[Byte], Array[Byte]]

createConsumer requests ConsumerStrategy to create a Kafka Consumer with

driverKafkaParams and new generated group.id Kafka property.

createConsumer is used when KafkaOffsetReader is created (and initializes


Note
consumer) and resetConsumer

Creating Kafka Consumer (Unless Already Available) 


—  consumer Method

consumer: Consumer[Array[Byte], Array[Byte]]

consumer gives the cached Kafka Consumer or creates one itself.

369
KafkaOffsetReader

Since consumer method is used (to access the internal Kafka Consumer) in the
fetch methods that gives the property of creating a new Kafka Consumer
Note
whenever the internal Kafka Consumer reference become null , i.e. as in
resetConsumer.

consumer …​FIXME

consumer is used when KafkaOffsetReader is requested to


Note fetchTopicPartitions, fetchSpecificOffsets, fetchEarliestOffsets, and
fetchLatestOffsets.

Closing —  close Method

close(): Unit

close stop the Kafka Consumer (if the Kafka Consumer is available).

close requests the ExecutorService to shut down.

close is used when:

KafkaContinuousReader, KafkaMicroBatchReader, and KafkaSource are


Note requested to stop a streaming reader or source
KafkaRelation is requested to build a distributed data scan with column
pruning

runUninterruptibly Internal Method

runUninterruptibly[T](body: => T): T

runUninterruptibly …​FIXME

Note runUninterruptibly is used when…​FIXME

stopConsumer Internal Method

stopConsumer(): Unit

stopConsumer …​FIXME

Note stopConsumer is used when…​FIXME

370
KafkaOffsetReader

Textual Representation —  toString Method

toString: String

toString is part of the java.lang.Object contract for the string representation of


Note
the object.

toString …​FIXME

Internal Properties

Name Description
Kafka’s Consumer ( Consumer[Array[Byte],
Array[Byte]] )

Initialized when KafkaOffsetReader is created.


Used when KafkaOffsetReader :
fetchTopicPartitions
_consumer
fetches offsets for selected TopicPartitions
fetchEarliestOffsets
fetchLatestOffsets
resetConsumer
is closed

execContext scala.concurrent.ExecutionContextExecutorService

groupId

kafkaReaderThread java.util.concurrent.ExecutorService

maxOffsetFetchAttempts

nextId Initially 0

offsetFetchAttemptIntervalMs

371
ConsumerStrategy

ConsumerStrategy Contract for


KafkaConsumer Providers
ConsumerStrategy is the contract for components that can create a KafkaConsumer using

the given Kafka parameters.

createConsumer(kafkaParams: java.util.Map[String, Object]): Consumer[Array[Byte], Array


[Byte]]

Table 1. Available ConsumerStrategies


ConsumerStrategy createConsumer

AssignStrategy
Uses KafkaConsumer.assign(Collection<TopicPartition>
partitions)

SubscribeStrategy
Uses KafkaConsumer.subscribe(Collection<String>
topics)

Uses KafkaConsumer.subscribe(Pattern pattern,


ConsumerRebalanceListener listener) with
NoOpConsumerRebalanceListener .
SubscribePatternStrategy

Refer to java.util.regex.Pattern for the format


Tip
of supported topic subscription regex patterns.

372
KafkaSink

KafkaSink
KafkaSink is a streaming sink that KafkaSourceProvider registers as the kafka format.

// start spark-shell or a Spark application with spark-sql-kafka-0-10 module


// spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPSHOT
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
spark.
readStream.
format("text").
load("server-logs/*.out").
as[String].
writeStream.
queryName("server-logs processor").
format("kafka"). // <-- uses KafkaSink
option("topic", "topic1").
option("checkpointLocation", "/tmp/kafka-sink-checkpoint"). // <-- mandatory
start

// in another terminal
$ echo hello > server-logs/hello.out

// in the terminal with Spark


FIXME

Creating KafkaSink Instance


KafkaSink takes the following when created:

SQLContext

Kafka parameters (used on executor) as a map of (String, Object) pairs

Optional topic name

addBatch Method

addBatch(batchId: Long, data: DataFrame): Unit

Internally, addBatch requests KafkaWriter to write the input data to the topic (if defined)
or a topic in executorKafkaParams.

Note addBatch is a part of Sink Contract to "add" a batch of data to the sink.

373
KafkaSink

374
KafkaOffsetRangeLimit

KafkaOffsetRangeLimit — Desired Offset
Range Limits
KafkaOffsetRangeLimit represents the desired offset range limits for starting, ending, and

specific offsets in Kafka Data Source.

Table 1. KafkaOffsetRangeLimits
KafkaOffsetRangeLimit Description
EarliestOffsetRangeLimit Intent to bind to the earliest offset

LatestOffsetRangeLimit Intent to bind to the latest offset

Intent to bind to specific offsets with the following special


offset "magic" numbers:

SpecificOffsetRangeLimit
-1 or KafkaOffsetRangeLimit.LATEST - the latest
offset
-2 or KafkaOffsetRangeLimit.EARLIEST - the earliest
offset

KafkaOffsetRangeLimit is a Scala sealed trait which means that all the


Note
implementations are in the same compilation unit (a single file).

KafkaOffsetRangeLimit is often used in a text-based representation and is converted to from

latest, earliest or a JSON-formatted text using


KafkaSourceProvider.getKafkaOffsetRangeLimit object method.

A JSON-formatted text is of the following format {"topicName":


Note
{"partition":offset},…​} , e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} .

KafkaOffsetRangeLimit is used when:

KafkaContinuousReader is created (with the initial offsets)

KafkaMicroBatchReader is created (with the starting offsets)

KafkaRelation is created (with the starting and ending offsets)

KafkaSource is created (with the starting offsets)

KafkaSourceProvider is requested to convert configuration options to

KafkaOffsetRangeLimits

375
KafkaOffsetRangeLimit

376
KafkaDataConsumer

KafkaDataConsumer
KafkaDataConsumer is the abstraction of Kafka consumers that use InternalKafkaConsumer

that can be released.

Table 1. KafkaDataConsumer Contract (Abstract Methods Only)


Method Description

internalConsumer: InternalKafkaConsumer
internalConsumer

Used when…​FIXME

release(): Unit
release

Used when…​FIXME

Table 2. KafkaDataConsumers
KafkaDataConsumer Description
CachedKafkaDataConsumer

NonCachedKafkaDataConsumer

Acquiring Cached KafkaDataConsumer for Partition 


—  acquire Object Method

acquire(
topicPartition: TopicPartition,
kafkaParams: ju.Map[String, Object],
useCache: Boolean
): KafkaDataConsumer

acquire …​FIXME

Note acquire is used when…​FIXME

Getting Kafka Record —  get Method

377
KafkaDataConsumer

get(
offset: Long,
untilOffset: Long,
pollTimeoutMs: Long,
failOnDataLoss: Boolean
): ConsumerRecord[Array[Byte], Array[Byte]]

get …​FIXME

Note get is used when…​FIXME

378
KafkaMicroBatchReader

KafkaMicroBatchReader
KafkaMicroBatchReader is the MicroBatchReader for kafka data source for Micro-Batch

Stream Processing.

KafkaMicroBatchReader is created exclusively when KafkaSourceProvider is requested to

create a MicroBatchReader.

KafkaMicroBatchReader uses the DataSourceOptions to access the

kafkaConsumer.pollTimeoutMs option (default: spark.network.timeout or 120s ).

KafkaMicroBatchReader uses the DataSourceOptions to access the maxOffsetsPerTrigger

option (default: (undefined) ).

KafkaMicroBatchReader uses the Kafka properties for executors to create

KafkaMicroBatchInputPartitions when requested to planInputPartitions.

Enable ALL logging level for


org.apache.spark.sql.kafka010.KafkaMicroBatchReader to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.kafka010.KafkaMicroBatchReader=ALL

Refer to Logging.

Creating KafkaMicroBatchReader Instance


KafkaMicroBatchReader takes the following to be created:

KafkaOffsetReader

Kafka properties for executors ( Map[String, Object] )

DataSourceOptions

Metadata Path

Desired starting KafkaOffsetRangeLimit

failOnDataLoss option

KafkaMicroBatchReader initializes the internal registries and counters.

379
KafkaMicroBatchReader

readSchema Method

readSchema(): StructType

Note readSchema is part of the DataSourceReader contract to…​FIXME.

readSchema simply returns the predefined fixed schema.

Stopping Streaming Reader —  stop Method

stop(): Unit

Note stop is part of the BaseStreamingSource Contract to stop a streaming reader.

stop simply requests the KafkaOffsetReader to close.

Plan Input Partitions —  planInputPartitions Method

planInputPartitions(): java.util.List[InputPartition[InternalRow]]

planInputPartitions is part of the DataSourceReader contract in Spark SQL for


the number of InputPartitions to use as RDD partitions (when
Note
DataSourceV2ScanExec physical operator is requested for the partitions of the
input RDD).

planInputPartitions first finds the new partitions ( TopicPartitions that are in the

endPartitionOffsets but not in the startPartitionOffsets) and requests the KafkaOffsetReader


to fetch their earliest offsets.

planInputPartitions prints out the following INFO message to the logs:

Partitions added: [newPartitionInitialOffsets]

planInputPartitions then prints out the following DEBUG message to the logs:

TopicPartitions: [comma-separated list of TopicPartitions]

planInputPartitions requests the KafkaOffsetRangeCalculator for offset ranges (given the

startPartitionOffsets and the newly-calculated newPartitionInitialOffsets as the


fromOffsets , the endPartitionOffsets as the untilOffsets , and the available executors

380
KafkaMicroBatchReader

(sorted in descending order)).

In the end, planInputPartitions creates a KafkaMicroBatchInputPartition for every offset


range (with the Kafka properties for executors, the pollTimeoutMs, the failOnDataLoss flag
and whether to reuse a Kafka consumer among Spark tasks).

KafkaMicroBatchInputPartition uses a shared Kafka consumer only when all the


Note offset ranges have distinct TopicPartitions , so concurrent tasks (of a stage in
a Spark job) will not interfere and read the same TopicPartitions .

planInputPartitions reports data loss when…​FIXME

Available Executors in Spark Cluster (Sorted By Host and


Executor ID in Descending Order) 
—  getSortedExecutorList Internal Method

getSortedExecutorList(): Array[String]

getSortedExecutorList requests the BlockManager to request the BlockManagerMaster to

get the peers (the other nodes in a Spark cluster), creates a ExecutorCacheTaskLocation for
every pair of host and executor ID, and in the end, sort it in descending order.

getSortedExecutorList is used exclusively when KafkaMicroBatchReader is


Note
requested to planInputPartitions (and calculates offset ranges).

getOrCreateInitialPartitionOffsets Internal
Method

getOrCreateInitialPartitionOffsets(): PartitionOffsetMap

getOrCreateInitialPartitionOffsets …​FIXME

getOrCreateInitialPartitionOffsets is used exclusively for the


Note
initialPartitionOffsets internal registry.

getStartOffset Method

getStartOffset: Offset

getStartOffset is part of the MicroBatchReader Contract to get the start


Note
(beginning) offsets.

381
KafkaMicroBatchReader

getStartOffset …​FIXME

getEndOffset Method

getEndOffset: Offset

Note getEndOffset is part of the MicroBatchReader Contract to get the end offsets.

getEndOffset …​FIXME

deserializeOffset Method

deserializeOffset(json: String): Offset

deserializeOffset is part of the MicroBatchReader Contract to deserialize an


Note
offset (from JSON format).

deserializeOffset …​FIXME

Internal Properties

Name Description
Ending offsets for the assigned partitions
endPartitionOffsets ( Map[TopicPartition, Long] )
Used when…​FIXME

initialPartitionOffsets initialPartitionOffsets: Map[TopicPartition, Long]

KafkaOffsetRangeCalculator (for the given


DataSourceOptions)
rangeCalculator
Used exclusively when KafkaMicroBatchReader is
requested to planInputPartitions (to calculate offset ranges)

Starting offsets for the assigned partitions


startPartitionOffsets ( Map[TopicPartition, Long] )
Used when…​FIXME

382
KafkaMicroBatchReader

383
KafkaOffsetRangeCalculator

KafkaOffsetRangeCalculator
KafkaOffsetRangeCalculator is created for KafkaMicroBatchReader to calculate offset

ranges (when KafkaMicroBatchReader is requested to planInputPartitions).

KafkaOffsetRangeCalculator takes an optional minimum number of partitions per

executor ( minPartitions ) to be created (that can either be undefined or greater than 0 ).

When created with a DataSourceOptions , KafkaOffsetRangeCalculator uses minPartitions


option for the minimum number of partitions per executor.

Offset Ranges —  getRanges Method

getRanges(
fromOffsets: PartitionOffsetMap,
untilOffsets: PartitionOffsetMap,
executorLocations: Seq[String] = Seq.empty): Seq[KafkaOffsetRange]

getRanges finds the common TopicPartitions that are the keys that are used in the

fromOffsets and untilOffsets collections (intersection).

For every common TopicPartition , getRanges creates a KafkaOffsetRange with the from
and until offsets from the fromOffsets and untilOffsets collections (and the preferredLoc
undefined). getRanges filters out the TopicPartitions that have no records to consume (i.e.
the difference between until and from offsets is not greater than 0 ).

At this point, getRanges knows the TopicPartitions with records to consume.

getRanges branches off based on the defined minimum number of partitions per executor

and the number of KafkaOffsetRanges ( TopicPartitions with records to consume).

For the minimum number of partitions per executor undefined or smaller than the number of
KafkaOffsetRanges ( TopicPartitions to consume records from), getRanges updates every

KafkaOffsetRange with the preferred executor based on the TopicPartition and the

executorLocations ).

Otherwise (with the minimum number of partitions per executor defined and greater than the
number of KafkaOffsetRanges ), getRanges splits KafkaOffsetRanges into smaller ones.

getRanges is used exclusively when KafkaMicroBatchReader is requested to


Note
planInputPartitions.

384
KafkaOffsetRangeCalculator

KafkaOffsetRange — TopicPartition with From and Until


Offsets and Optional Preferred Location
KafkaOffsetRange is a case class with the following attributes:

TopicPartition

fromOffset offset

untilOffset offset

Optional preferred location

KafkaOffsetRange knows the size, i.e. the number of records between the untilOffset and

fromOffset offsets.

Selecting Preferred Executor for TopicPartition 


—  getLocation Internal Method

getLocation(
tp: TopicPartition,
executorLocations: Seq[String]): Option[String]

getLocation …​FIXME

getLocation is used exclusively when KafkaOffsetRangeCalculator is


Note
requested to calculate offset ranges.

385
KafkaMicroBatchInputPartition

KafkaMicroBatchInputPartition
KafkaMicroBatchInputPartition is an InputPartition (of InternalRows ) that is used

(created) exclusively when KafkaMicroBatchReader is requested for input partitions (when


DataSourceV2ScanExec physical operator is requested for the partitions of the input RDD).

KafkaMicroBatchInputPartition takes the following to be created:

KafkaOffsetRange

Kafka parameters used for Kafka clients on executors ( Map[String, Object] )

Poll timeout (in ms)

failOnDataLoss flag

reuseKafkaConsumer flag

KafkaMicroBatchInputPartition creates a KafkaMicroBatchInputPartitionReader when

requested for a InputPartitionReader[InternalRow] (as a part of the InputPartition


contract).

KafkaMicroBatchInputPartition simply requests the given KafkaOffsetRange for the optional

preferredLoc when requested for preferredLocations (as a part of the InputPartition

contract).

386
KafkaMicroBatchInputPartitionReader

KafkaMicroBatchInputPartitionReader
KafkaMicroBatchInputPartitionReader is an InputPartitionReader (of InternalRows ) that is

created exclusively when KafkaMicroBatchInputPartition is requested for one (as a part of


the InputPartition contract).

Creating KafkaMicroBatchInputPartitionReader Instance


KafkaMicroBatchInputPartitionReader takes the following to be created:

KafkaOffsetRange

Kafka parameters used for Kafka clients on executors ( Map[String, Object] )

Poll timeout (in ms)

failOnDataLoss flag

reuseKafkaConsumer flag

All the input arguments to create a KafkaMicroBatchInputPartitionReader are


Note
exactly the input arguments used to create a KafkaMicroBatchInputPartition.

KafkaMicroBatchInputPartitionReader initializes the internal properties.

next Method

next(): Boolean

next is part of the InputPartitionReader contract to proceed to next record if


Note
available ( true ).

next checks whether the KafkaDataConsumer should poll records or not (i.e. nextOffset is

smaller than the untilOffset of the KafkaOffsetRange).

next Method — KafkaDataConsumer Polls Records

If so, next requests the KafkaDataConsumer to get (poll) records in the range of nextOffset
and the untilOffset (of the KafkaOffsetRange) with the given pollTimeoutMs and
failOnDataLoss.

387
KafkaMicroBatchInputPartitionReader

With a new record, next requests the KafkaRecordToUnsafeRowConverter to convert


( toUnsafeRow ) the record to be the next UnsafeRow. next sets the nextOffset as the offset
of the record incremented. next returns true .

With no new record, next simply returns false .

next Method — No Polling

If the nextOffset is equal or larger than the untilOffset (of the KafkaOffsetRange), next
simply returns false .

Closing (Releasing KafkaDataConsumer) —  close


Method

close(): Unit

Note close is part of the Java Closeable contract to release resources.

close simply requests the KafkaDataConsumer to release .

resolveRange Internal Method

resolveRange(
range: KafkaOffsetRange): KafkaOffsetRange

resolveRange …​FIXME

resolveRange is used exclusively when KafkaMicroBatchInputPartitionReader


Note
is created (and initializes the KafkaOffsetRange internal property).

Internal Properties

388
KafkaMicroBatchInputPartitionReader

Name Description

KafkaDataConsumer for the partition (per


consumer KafkaOffsetRange)
Used in next, close, and resolveRange

converter KafkaRecordToUnsafeRowConverter

nextOffset Next offset

nextRow Next UnsafeRow

rangeToRead KafkaOffsetRange

389
KafkaSourceInitialOffsetWriter

KafkaSourceInitialOffsetWriter
KafkaSourceInitialOffsetWriter is a Hadoop DFS-based metadata storage for

KafkaSourceOffsets.

KafkaSourceInitialOffsetWriter is created exclusively when KafkaMicroBatchReader is

requested to getOrCreateInitialPartitionOffsets.

KafkaSourceInitialOffsetWriter uses 1 for the version.

Creating KafkaSourceInitialOffsetWriter Instance


KafkaSourceInitialOffsetWriter takes the following to be created:

SparkSession

Path of the metadata log directory

Deserializing Metadata (Reading Metadata from Serialized


Format) —  deserialize Method

deserialize(
in: InputStream): KafkaSourceOffset

deserialize is part of the HDFSMetadataLog Contract to deserialize metadata


Note
(reading metadata from a serialized format)

deserialize …​FIXME

390
KafkaContinuousReader

KafkaContinuousReader — ContinuousReader
for Kafka Data Source in Continuous Stream
Processing
KafkaContinuousReader is a ContinuousReader for Kafka Data Source in Continuous Stream

Processing.

KafkaContinuousReader is created exclusively when KafkaSourceProvider is requested to

create a ContinuousReader.

KafkaContinuousReader uses kafkaConsumer.pollTimeoutMs configuration parameter

(default: 512 ) for KafkaContinuousInputPartitions when requested to planInputPartitions.

Enable INFO or WARN logging levels for


org.apache.spark.sql.kafka010.KafkaContinuousReader to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.kafka010.KafkaContinuousReader=INFO

Refer to Logging.

Creating KafkaContinuousReader Instance


KafkaContinuousReader takes the following to be created:

KafkaOffsetReader

Kafka parameters (as java.util.Map[String, Object] )

Source options (as Map[String, String] )

Metadata path

Initial offsets

failOnDataLoss flag

Plan Input Partitions —  planInputPartitions Method

planInputPartitions(): java.util.List[InputPartition[InternalRow]]

391
KafkaContinuousReader

planInputPartitions is part of the DataSourceReader contract in Spark SQL for


the number of InputPartitions to use as RDD partitions (when
Note
DataSourceV2ScanExec physical operator is requested for the partitions of the
input RDD).

planInputPartitions …​FIXME

setStartOffset Method

setStartOffset(
start: Optional[Offset]): Unit

Note setStartOffset is part of the ContinuousReader Contract to…​FIXME.

setStartOffset …​FIXME

deserializeOffset Method

deserializeOffset(
json: String): Offset

Note deserializeOffset is part of the ContinuousReader Contract to…​FIXME.

deserializeOffset …​FIXME

mergeOffsets Method

mergeOffsets(
offsets: Array[PartitionOffset]): Offset

Note mergeOffsets is part of the ContinuousReader Contract to…​FIXME.

mergeOffsets …​FIXME

392
KafkaContinuousInputPartition

KafkaContinuousInputPartition
KafkaContinuousInputPartition is…​FIXME

393
TextSocketSourceProvider

TextSocketSourceProvider
TextSocketSourceProvider is a StreamSourceProvider for TextSocketSource that read

records from host and port .

TextSocketSourceProvider is a DataSourceRegister, too.

The short name of the data source is socket .

It requires two mandatory options (that you can set using option method):

1. host which is the host name.

2. port which is the port number. It must be an integer.

TextSocketSourceProvider also supports includeTimestamp option that is a boolean flag that

you can use to include timestamps in the schema.

includeTimestamp Option

Caution FIXME

createSource
createSource grabs the two mandatory options —  host and port  — and returns an

TextSocketSource.

sourceSchema
sourceSchema returns textSocket as the name of the source and the schema that can be

one of the two available schemas:

1. SCHEMA_REGULAR (default) which is a schema with a single value field of String type.

2. SCHEMA_TIMESTAMP when includeTimestamp flag option is set. It is not, i.e. false , by

default. The schema are value field of StringType type and timestamp field of
TimestampType type of format yyyy-MM-dd HH:mm:ss .

Tip Read about schema.

Internally, it starts by printing out the following WARN message to the logs:

394
TextSocketSourceProvider

WARN TextSocketSourceProvider: The socket source should not be used for production app
lications! It does not support recovery and stores state indefinitely.

It then checks whether host and port parameters are defined and if not it throws a
AnalysisException :

Set a host to read from with option("host", ...).

395
TextSocketSource

TextSocketSource
TextSocketSource is a streaming source that reads lines from a socket at the host and

port (defined by parameters).

It uses lines internal in-memory buffer to keep all of the lines that were read from a socket
forever.

This source is not for production use due to design contraints, e.g. infinite in-
memory collection of lines read and no fault recovery.
Caution
It is designed only for tutorials and debugging.

396
TextSocketSource

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.getOrCreate()

// Connect to localhost:9999
// You can use "nc -lk 9999" for demos
val textSocket = spark.
readStream.
format("socket").
option("host", "localhost").
option("port", 9999).
load

import org.apache.spark.sql.Dataset
val lines: Dataset[String] = textSocket.as[String].map(_.toUpperCase)

val query = lines.writeStream.format("console").start

// Start typing the lines in nc session


// They will appear UPPERCASE in the terminal

-------------------------------------------
Batch: 0
-------------------------------------------
+---------+
| value|
+---------+
|UPPERCASE|
+---------+

scala> query.explain
== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, Str
ingType, fromString, input[0, java.lang.String, true], true) AS value#21]
+- *MapElements <function1>, obj#20: java.lang.String
+- *DeserializeToObject value#43.toString, obj#19: java.lang.String
+- LocalTableScan [value#43]

scala> query.stop

lines Internal Buffer

lines: ArrayBuffer[(String, Timestamp)]

lines is the internal buffer of all the lines TextSocketSource read from the socket.

Maximum Available Offset (getOffset method)

397
TextSocketSource

Note getOffset is a part of the Streaming Source Contract.

TextSocketSource 's offset can either be none or LongOffset of the number of lines in the

internal lines buffer.

Schema (schema method)


TextSocketSource supports two schemas:

1. A single value field of String type.

2. value field of StringType type and timestamp field of TimestampType type of format

yyyy-MM-dd HH:mm:ss .

Tip Refer to sourceSchema for TextSocketSourceProvider .

Creating TextSocketSource Instance

TextSocketSource(
host: String,
port: Int,
includeTimestamp: Boolean,
sqlContext: SQLContext)

When TextSocketSource is created (see TextSocketSourceProvider), it gets 4 parameters


passed in:

1. host

2. port

3. includeTimestamp flag

4. SQLContext

It appears that the source did not get "renewed" to use SparkSession
Caution
instead.

It opens a socket at given host and port parameters and reads a buffering character-
input stream using the default charset and the default-sized input buffer (of 8192 bytes) line
by line.

Caution FIXME Review Java’s Charset.defaultCharset()

398
TextSocketSource

It starts a readThread daemon thread (called TextSocketSource(host, port) ) to read lines


from the socket. The lines are added to the internal lines buffer.

Stopping TextSocketSource (stop method)


When stopped, TextSocketSource closes the socket connection.

399
RateSourceProvider

RateSourceProvider
RateSourceProvider is a StreamSourceProvider for RateStreamSource (that acts as the

source for rate format).

Note RateSourceProvider is also a DataSourceRegister .

The short name of the data source is rate.

400
RateStreamSource

RateStreamSource
RateStreamSource is a streaming source that generates consecutive numbers with

timestamp that can be useful for testing and PoCs.

RateStreamSource is created for rate format (that is registered by RateSourceProvider).

val rates = spark


.readStream
.format("rate") // <-- use RateStreamSource
.option("rowsPerSecond", 1)
.load

Table 1. RateStreamSource’s Options


Name Default Value Description
(default
numPartitions Number of partitions to use
parallelism)

rampUpTime 0 (seconds)

rowsPerSecond 1
Number of rows to generate per second
(has to be greater than 0 )

RateStreamSource uses a predefined schema that cannot be changed.

val schema = rates.schema


scala> println(schema.treeString)
root
|-- timestamp: timestamp (nullable = true)
|-- value: long (nullable = true)

Table 2. RateStreamSource’s Dataset Schema (in the positional order)


Name Type
timestamp TimestampType

value LongType

401
RateStreamSource

Table 3. RateStreamSource’s Internal Registries and Counters


Name Description
clock

lastTimeMs

maxSeconds

startTimeMs

Enable INFO or DEBUG logging levels for


org.apache.spark.sql.execution.streaming.RateStreamSource to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.RateStreamSource=DEBUG

Refer to Logging.

Getting Maximum Available Offsets —  getOffset


Method

getOffset: Option[Offset]

Note getOffset is a part of the Source Contract.

Caution FIXME

Generating DataFrame for Streaming Batch —  getBatch


Method

getBatch(start: Option[Offset], end: Offset): DataFrame

Note getBatch is a part of Source Contract.

Internally, getBatch calculates the seconds to start from and end at (from the input start
and end offsets) or assumes 0 .

getBatch then calculates the values to generate for the start and end seconds.

You should see the following DEBUG message in the logs:

402
RateStreamSource

DEBUG RateStreamSource: startSeconds: [startSeconds], endSeconds: [endSeconds], rangeS


tart: [rangeStart], rangeEnd: [rangeEnd]

If the start and end ranges are equal, getBatch creates an empty DataFrame (with the
schema) and returns.

Otherwise, when the ranges are different, getBatch creates a DataFrame using
SparkContext.range operator (for the start and end ranges and numPartitions partitions).

Creating RateStreamSource Instance


RateStreamSource takes the following when created:

SQLContext

Path to the metadata

Rows per second

RampUp time in seconds

Number of partitions

Flag to whether to use ManualClock ( true ) or SystemClock ( false )

RateStreamSource initializes the internal registries and counters.

403
RateStreamMicroBatchReader

RateStreamMicroBatchReader
RateStreamMicroBatchReader is…​FIXME

404
ConsoleSinkProvider

ConsoleSinkProvider
ConsoleSinkProvider is a DataSourceV2 with StreamWriteSupport for console data source

format.

Tip Read up on DataSourceV2 Contract in The Internals of Spark SQL book.

ConsoleSinkProvider is a DataSourceRegister and registers itself as the console data

source format.

import org.apache.spark.sql.streaming.Trigger
val q = spark
.readStream
.format("rate")
.load
.writeStream
.format("console") // <-- requests ConsoleSinkProvider for a sink
.trigger(Trigger.Once)
.start
scala> println(q.lastProgress.sink)
{
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@2392cf
b1"
}

When requested for a StreamWriter, ConsoleSinkProvider simply creates a ConsoleWriter


(with the given schema and options).

ConsoleSinkProvider is a CreatableRelationProvider.

Tip Read up on CreatableRelationProvider in The Internals of Spark SQL book.

createRelation Method

createRelation(
sqlContext: SQLContext,
mode: SaveMode,
parameters: Map[String, String],
data: DataFrame): BaseRelation

createRelation is part of the CreatableRelationProvider Contract to support


Note
writing a structured query (a DataFrame) per save mode.

createRelation …​FIXME

405
ConsoleSinkProvider

406
ConsoleWriter

ConsoleWriter
ConsoleWriter is a StreamWriter for console data source format.

407
ForeachWriterProvider

ForeachWriterProvider
ForeachWriterProvider is…​FIXME

408
ForeachWriter

ForeachWriter
ForeachWriter is the contract for a foreach writer that is a streaming format that controls

streaming writes.

Note ForeachWriter is set using foreach operator.

val foreachWriter = new ForeachWriter[String] { ... }


streamingQuery.
writeStream.
foreach(foreachWriter).
start

ForeachWriter Contract

package org.apache.spark.sql

abstract class ForeachWriter[T] {


def open(partitionId: Long, version: Long): Boolean
def process(value: T): Unit
def close(errorOrNull: Throwable): Unit
}

Table 1. ForeachWriter Contract


Method Description
open Used when…​

process Used when…​

close Used when…​

409
ForeachSink

ForeachSink
ForeachSink is a typed streaming sink that passes rows (of the type T ) to ForeachWriter

(one record at a time per partition).

Note ForeachSink is assigned a ForeachWriter when DataStreamWriter is started.

ForeachSink is used exclusively in foreach operator.

val records = spark.


readStream
format("text").
load("server-logs/*.out").
as[String]

import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
override def open(partitionId: Long, version: Long) = true
override def process(value: String) = println(value)
override def close(errorOrNull: Throwable) = {}
}

records.writeStream
.queryName("server-logs processor")
.foreach(writer)
.start

Internally, addBatch (the only method from the Sink Contract) takes records from the input
DataFrame (as data ), transforms them to expected type T (of this ForeachSink ) and
(now as a Dataset) processes each partition.

addBatch(batchId: Long, data: DataFrame): Unit

addBatch then opens the constructor’s ForeachWriter (for the current partition and the input

batch) and passes the records to process (one at a time per partition).

FIXME Why does Spark track whether the writer failed or not? Why couldn’t
Caution
it finally and do close ?

FIXME Can we have a constant for "foreach" for source in


Caution
DataStreamWriter ?

410
ForeachSink

411
ForeachBatchSink

ForeachBatchSink
ForeachBatchSink is a streaming sink that is used for the DataStreamWriter.foreachBatch

streaming operator.

ForeachBatchSink is created exclusively when DataStreamWriter is requested to start

execution of the streaming query (with the foreachBatch source).

ForeachBatchSink uses ForeachBatchSink name.

import org.apache.spark.sql.Dataset
val q = spark.readStream
.format("rate")
.load
.writeStream
.foreachBatch { (output: Dataset[_], batchId: Long) => // <-- creates a ForeachBatch
Sink
println(s"Batch ID: $batchId")
output.show
}
.start
// q.stop

scala> println(q.lastProgress.sink.description)
ForeachBatchSink

ForeachBatchSink was added in Spark 2.4.0 as part of SPARK-24565 Add API


Note for in Structured Streaming for exposing output rows of each microbatch as a
DataFrame.

Creating ForeachBatchSink Instance


ForeachBatchSink takes the following when created:

Batch writer ( (Dataset[T], Long) ⇒ Unit )

Encoder ( ExpressionEncoder[T] )

Adding Batch —  addBatch Method

addBatch(batchId: Long, data: DataFrame): Unit

Note addBatch is a part of Sink Contract to "add" a batch of data to the sink.

412
ForeachBatchSink

addBatch …​FIXME

413
Memory Data Source

Memory Data Source


Memory Data Source is made up of the following two base implementations to support the
older DataSource API V1 and the modern DataSource API V2:

MemoryStreamBase

MemorySinkBase

Memory data source supports Micro-Batch and Continuous stream processing modes.

Stream Processing Source Sink


Micro-Batch MemoryStream MemorySink

Continuous ContinuousMemoryStream MemorySinkV2

Memory Data Source is not for production use due to design contraints, e.g.
infinite in-memory collection of lines read and no fault recovery.
Caution
MemoryStream is designed primarily for unit tests, tutorials and debugging.

Memory Sink
Memory sink requires that a streaming query has a name (defined using
DataStreamWriter.queryName or queryName option).

Memory sink may optionally define checkpoint location using checkpointLocation option
that is used to recover from for Complete output mode only.

Memory Sink and CreateViewCommand


When a streaming query with memory sink is started, DataStreamWriter uses
Dataset.createOrReplaceTempView operator to create or replace a local temporary view with

the name of the query (which is required).

414
Memory Data Source

Figure 1. Memory Sink and CreateViewCommand

Examples
Memory Source in Micro-Batch Stream Processing

415
Memory Data Source

val spark: SparkSession = ???

implicit val ctx = spark.sqlContext

import org.apache.spark.sql.execution.streaming.MemoryStream
// It uses two implicits: Encoder[Int] and SQLContext
val intsIn = MemoryStream[Int]

val ints = intsIn.toDF


.withColumn("t", current_timestamp())
.withWatermark("t", "5 minutes")
.groupBy(window($"t", "5 minutes") as "window")
.agg(count("*") as "total")

import org.apache.spark.sql.streaming.{OutputMode, Trigger}


import scala.concurrent.duration._
val totalsOver5mins = ints.
writeStream.
format("memory").
queryName("totalsOver5mins").
outputMode(OutputMode.Append).
trigger(Trigger.ProcessingTime(10.seconds)).
start

val zeroOffset = intsIn.addData(0, 1, 2)


totalsOver5mins.processAllAvailable()
spark.table("totalsOver5mins").show

scala> intsOut.show
+-----+
|value|
+-----+
| 0|
| 1|
| 2|
+-----+

memoryQuery.stop()

Memory Sink in Micro-Batch Stream Processing

416
Memory Data Source

val queryName = "memoryDemo"


val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("memory")
.queryName(queryName)
.start

// The name of the streaming query is an in-memory table


val showAll = sql(s"select * from $queryName")
scala> showAll.show(truncate = false)
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2019-10-10 15:19:16.431|42 |
|2019-10-10 15:19:17.431|43 |
+-----------------------+-----+

import org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isInstanceOf[StreamingQuery])

import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val se = sq.asInstanceOf[StreamingQueryWrapper].streamingQuery

import org.apache.spark.sql.execution.streaming.MemorySink
val sink = se.sink.asInstanceOf[MemorySink]

assert(sink.toString == "MemorySink")

sink.clear()

417
MemoryStream

MemoryStream — Streaming Reader for Micro-


Batch Stream Processing
MemoryStream is a concrete streaming source of memory data source that supports reading

in Micro-Batch Stream Processing.

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.MemoryStream logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.MemoryStream=ALL

Refer to Logging.

Creating MemoryStream Instance


MemoryStream takes the following to be created:

ID

SQLContext

MemoryStream initializes the internal properties.

Creating MemoryStream Instance —  apply Object


Factory

apply[A : Encoder](
implicit sqlContext: SQLContext): MemoryStream[A]

apply uses an memoryStreamId internal counter to create a new MemoryStream with a

unique ID and the implicit SQLContext .

Adding Data to Source —  addData Method

addData(
data: TraversableOnce[A]): Offset

418
MemoryStream

addData adds the given data to the batches internal registry.

Internally, addData prints out the following DEBUG message to the logs:

Adding: [data]

In the end, addData increments the current offset and adds the data to the batches internal
registry.

Generating Next Streaming Batch —  getBatch Method

Note getBatch is a part of Streaming Source contract.

When executed, getBatch uses the internal batches collection to return requested offsets.

You should see the following DEBUG message in the logs:

DEBUG MemoryStream: MemoryBatch [[startOrdinal], [endOrdinal]]: [newBlocks]

Logical Plan —  logicalPlan Internal Property

logicalPlan: LogicalPlan

logicalPlan is part of the MemoryStreamBase Contract for the logical query


Note
plan of the memory stream.

logicalPlan is simply a StreamingExecutionRelation (for this memory source and the

attributes).

MemoryStream uses StreamingExecutionRelation logical plan to build Datasets or

DataFrames when requested.

scala> val ints = MemoryStream[Int]


ints: org.apache.spark.sql.execution.streaming.MemoryStream[Int] = MemoryStream[value#
13]

scala> ints.toDS.queryExecution.logical.isStreaming
res14: Boolean = true

scala> ints.toDS.queryExecution.logical
res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MemoryStream[value#13]

419
MemoryStream

Schema (schema method)


MemoryStream works with the data of the schema as described by the Encoder (of the

Dataset ).

Textual Representation —  toString Method

toString: String

toString is part of the java.lang.Object contract for the string representation of


Note
the object.

toString uses the output schema to return the following textual representation:

MemoryStream[[output]]

Plan Input Partitions —  planInputPartitions Method

planInputPartitions(): java.util.List[InputPartition[InternalRow]]

planInputPartitions is part of the DataSourceReader contract in Spark SQL for


the number of InputPartitions to use as RDD partitions (when
Note
DataSourceV2ScanExec physical operator is requested for the partitions of the
input RDD).

planInputPartitions …​FIXME

planInputPartitions prints out a DEBUG message to the logs with the

generateDebugString (with the batches after the last committed offset).

planInputPartitions …​FIXME

generateDebugString Internal Method

generateDebugString(
rows: Seq[UnsafeRow],
startOrdinal: Int,
endOrdinal: Int): String

generateDebugString resolves and binds the encoder for the data.

In the end, generateDebugString returns the following string:

420
MemoryStream

MemoryBatch [[startOrdinal], [endOrdinal]]: [rows]

generateDebugString is used exclusively when MemoryStream is requested to


Note
planInputPartitions.

Internal Properties
Name Description
batches Batch data ( ListBuffer[Array[UnsafeRow]] )

currentOffset Current offset (as LongOffset)

lastOffsetCommitted Last committed offset (as LongOffset)

Output schema ( Seq[Attribute] ) of the logical query plan


output
Used exclusively for toString

421
ContinuousMemoryStream

ContinuousMemoryStream
ContinuousMemoryStream is…​FIXME

422
MemorySink

MemorySink
MemorySink is a streaming sink that stores batches (records) in memory.

MemorySink is intended only for testing or demos.

MemorySink is used for memory format and requires a query name (by queryName method

or queryName option).

MemorySink was introduced in the pull request for [SPARK-14288][SQL]


Note
Memory Sink for streaming.

Use toDebugString to see the batches.

Its aim is to allow users to test streaming applications in the Spark shell or other local tests.

You can set checkpointLocation using option method or it will be set to


spark.sql.streaming.checkpointLocation property.

If spark.sql.streaming.checkpointLocation is set, the code uses $location/$queryName


directory.

Finally, when no spark.sql.streaming.checkpointLocation is set, a temporary directory


memory.stream under java.io.tmpdir is used with offsets subdirectory inside.

The directory is cleaned up at shutdown using


Note
ShutdownHookManager.registerShutdownDeleteDir .

It creates MemorySink instance based on the schema of the DataFrame it operates on.

It creates a new DataFrame using MemoryPlan with MemorySink instance created earlier
and registers it as a temporary table (using DataFrame.registerTempTable method).

At this point you can query the table as if it were a regular non-streaming table
Note
using sql method.

A new StreamingQuery is started (using StreamingQueryManager.startQuery) and returned.

423
MemorySink

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.MemorySink logger to see what
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.MemorySink=ALL

Refer to Logging.

Creating MemorySink Instance


MemorySink takes the following to be created:

Output schema

OutputMode

MemorySink initializes the batches internal property.

In-Memory Buffer of Streaming Batches —  batches


Internal Property

batches: ArrayBuffer[AddedData]

batches holds data from streaming batches that have been added (written) to this sink.

For Append and Update output modes, batches holds rows from all batches.

For Complete output mode, batches holds rows from the last batch only.

batches can be cleared (emptied) using clear.

Adding Batch of Data to Sink —  addBatch Method

addBatch(
batchId: Long,
data: DataFrame): Unit

Note addBatch is part of the Sink Contract to "add" a batch of data to the sink.

addBatch branches off based on whether the given batchId has already been committed

or not.

424
MemorySink

A batch ID is considered committed when the given batch ID is greater than the latest batch
ID (if available).

Batch Not Committed


With the batchId not committed, addBatch prints out the following DEBUG message to the
logs:

Committing batch [batchId] to [this]

addBatch collects records from the given data .

Note addBatch uses Dataset.collect operator to collect records.

For Append and Update output modes, addBatch adds the data (as a AddedData ) to the
batches internal registry.

For Complete output mode, addBatch clears the batches internal registry first before adding
the data (as a AddedData ).

For any other output mode, addBatch reports an IllegalArgumentException :

Output mode [outputMode] is not supported by MemorySink

Batch Committed
With the batchId committed, addBatch simply prints out the following DEBUG message to
the logs and returns.

Skipping already committed batch: [batchId]

Clearing Up Internal Batch Buffer —  clear Method

clear(): Unit

clear simply removes (clears) all data from the batches internal registry.

Note clear is used exclusively in tests.

425
MemorySink

426
MemorySinkV2

MemorySinkV2 — Writable Streaming Sink for


Continuous Stream Processing
MemorySinkV2 is a DataSourceV2 with StreamWriteSupport for memory data source format

in Continuous Stream Processing.

Tip Read up on DataSourceV2 Contract in The Internals of Spark SQL book.

MemorySinkV2 is a custom MemorySinkBase.

When requested for a StreamWriter, MemorySinkV2 simply creates a MemoryStreamWriter.

427
MemoryStreamWriter

MemoryStreamWriter
MemoryStreamWriter is…​FIXME

428
MemoryStreamBase

MemoryStreamBase Contract — Base Contract


for Memory Sources
MemoryStreamBase is the base of the BaseStreamingSource contract for memory sources

that can add data.

Table 1. MemoryStreamBase Contract


Method Description

addData(
addData data: TraversableOnce[A]): Offset

logicalPlan logicalPlan: LogicalPlan

Table 2. MemoryStreamBases
MemoryStreamBase Description

ContinuousMemoryStream

MemoryStream MicroBatchReader for Micro-Batch Stream Processing

Creating MemoryStreamBase Instance


MemoryStreamBase takes the following to be created:

SQLContext

MemoryStreamBase is a Scala abstract class and cannot be created directly. It is


Note
created indirectly for the concrete MemoryStreamBases.

Creating Streaming Dataset —  toDS Method

toDS(): Dataset[A]

toDS simply creates a Dataset (for the sqlContext and the logicalPlan)

Creating Streaming DataFrame —  toDF Method

429
MemoryStreamBase

toDF(): DataFrame

toDF simply creates a Dataset of rows (for the sqlContext and the logicalPlan)

Internal Properties

Name Description
Schema attributes of the encoder
attributes ( Seq[AttributeReference] )
Used when…​FIXME

Spark SQL’s ExpressionEncoder for the data


encoder
Used when…​FIXME

430
MemorySinkBase

MemorySinkBase Contract — Base Contract for


Memory Sinks
MemorySinkBase is the extension of the BaseStreamingSink contract for memory sinks that

manage all data in memory.

Table 1. MemorySinkBase Contract


Method Description

allData allData: Seq[Row]

dataSinceBatch(
dataSinceBatch sinceBatchId: Long): Seq[Row]

latestBatchData latestBatchData: Seq[Row]

latestBatchId latestBatchId: Option[Long]

Table 2. MemorySinkBases
MemorySinkBase Description

Streaming sink for Micro-Batch Stream Processing (based


MemorySink
on Data Source API V1)

Writable streaming sink for Continuous Stream Processing


MemorySinkV2
(based on Data Source API V2)

431
Offsets and Metadata Checkpointing

Offsets and Metadata Checkpointing


A streaming query can be started from scratch or from checkpoint (that gives fault-tolerance
as the state is preserved even when a failure happens).

Stream execution engines use checkpoint location to resume stream processing and get
start offsets to start query processing from.

StreamExecution resumes (populates the start offsets) from the latest checkpointed offsets

from the Write-Ahead Log (WAL) of Offsets that may have already been processed (and, if
so, committed to the Offset Commit Log).

Hadoop DFS-based metadata storage of OffsetSeqs

OffsetSeq and StreamProgress

StreamProgress and StreamExecutions (committed and available offsets)

Micro-Batch Stream Processing


In Micro-Batch Stream Processing, the available offsets registry is populated with the latest
offsets from the Write-Ahead Log (WAL) when MicroBatchExecution stream processing
engine is requested to populate start offsets from checkpoint (if available) when
MicroBatchExecution is requested to run an activated streaming query (before the first

"zero" micro-batch).

The available offsets are then added to the committed offsets when the latest batch ID
available (as described above) is exactly the latest batch ID committed to the Offset Commit
Log when MicroBatchExecution stream processing engine is requested to populate start
offsets from checkpoint.

When a streaming query is started from scratch (with no checkpoint that has offsets in the
Offset Write-Ahead Log), MicroBatchExecution prints out the following INFO message:

Starting new streaming query.

When a streaming query is resumed (restarted) from a checkpoint with offsets in the Offset
Write-Ahead Log, MicroBatchExecution prints out the following INFO message:

Resuming at batch [currentBatchId] with committed offsets


[committedOffsets] and available offsets [availableOffsets]

432
Offsets and Metadata Checkpointing

Every time MicroBatchExecution is requested to check whether a new data is available (in
any of the streaming sources)…​FIXME

When MicroBatchExecution is requested to construct the next streaming micro-batch (when


MicroBatchExecution requested to run the activated streaming query), every streaming

source is requested for the latest offset available that are added to the availableOffsets
registry. Streaming sources report some offsets or none at all (if this source has never
received any data). Streaming sources with no data are excluded (filtered out).

MicroBatchExecution prints out the following TRACE message to the logs:

noDataBatchesEnabled = [noDataBatchesEnabled],
lastExecutionRequiresAnotherBatch =
[lastExecutionRequiresAnotherBatch], isNewDataAvailable =
[isNewDataAvailable], shouldConstructNextBatch =
[shouldConstructNextBatch]

With shouldConstructNextBatch internal flag enabled, MicroBatchExecution commits (adds)


the available offsets for the batch to the Write-Ahead Log (WAL) and prints out the following
INFO message to the logs:

Committed offsets for batch [currentBatchId]. Metadata


[offsetSeqMetadata]

When running a single streaming micro-batch, MicroBatchExecution requests every Source


and MicroBatchReader (in the availableOffsets registry) for unprocessed data (that has not
been committed yet and so considered unprocessed).

In the end (of running a single streaming micro-batch), MicroBatchExecution commits (adds)
the available offsets (to the committedOffsets registry) so they are considered processed
already.

MicroBatchExecution prints out the following DEBUG message to the logs:

Completed batch [currentBatchId]

Limitations (Assumptions)
It is assumed that the order of streaming sources in a streaming query matches the order of
the offsets of OffsetSeq (in offsetLog) and availableOffsets.

433
Offsets and Metadata Checkpointing

In other words, a streaming query can be modified and then restarted from a checkpoint (to
maintain stream processing state) only when the number of streaming sources and their
order are preserved across restarts.

434
MetadataLog

MetadataLog Contract — Metadata Storage


MetadataLog is the abstraction of metadata storage that can persist, retrieve, and remove

metadata (of type T ).

Table 1. MetadataLog Contract


Method Description

add(
batchId: Long,
metadata: T): Boolean

Persists (adds) metadata of a streaming batch


Used when:
KafkaMicroBatchReader is requested to
getOrCreateInitialPartitionOffsets
KafkaSource is requested for the initialPartitionOffsets

CompactibleFileStreamLog is requested for the store


metadata of a streaming batch and to compact
add
FileStreamSource is requested to fetchMaxOffset

FileStreamSourceLog is requested to store (add)


metadata of a streaming batch
ManifestFileCommitProtocol is requested to commitJob

MicroBatchExecution stream execution engine is


requested to construct a next streaming micro-batch
and run a single streaming micro-batch

ContinuousExecution stream execution engine is


requested to addOffset and commit an epoch
RateStreamMicroBatchReader is created
( creationTimeMs )

get(
batchId: Long): Option[T]
get(
startId: Option[Long],
endId: Option[Long]): Array[(Long, T)]

get
Retrieves (gets) metadata of one or more batches
Used when…​FIXME

435
MetadataLog

getLatest(): Option[(Long, T)]


getLatest

Retrieves the latest-committed metadata (if available)

Used when…​FIXME

purge(thresholdBatchId: Long): Unit


purge

Used when…​FIXME

HDFSMetadataLog is the only direct implementation of the MetadataLog


Note
Contract in Spark Structured Streaming.

436
HDFSMetadataLog

HDFSMetadataLog — Hadoop DFS-based
Metadata Storage
HDFSMetadataLog is a concrete metadata storage (of type T ) that uses Hadoop DFS for

fault-tolerance and reliability.

HDFSMetadataLog uses the given path as the metadata directory with metadata logs. The

path is immediately converted to a Hadoop Path for file management.

HDFSMetadataLog uses Json4s with the Jackson binding for metadata serialization and

deserialization (to and from JSON format).

HDFSMetadataLog is further customized by the extensions.

Table 1. HDFSMetadataLogs (Direct Extensions Only)


HDFSMetadataLog Description
HDFSMetadataLog of KafkaSourceOffsets for
Anonymous
KafkaSource

HDFSMetadataLog of LongOffsets for


Anonymous
RateStreamMicroBatchReader

Offset commit log of streaming query execution


CommitLog
engines

Compactible metadata logs (that compact logs at


CompactibleFileStreamLog
regular interval)

HDFSMetadataLog of KafkaSourceOffsets for


KafkaSourceInitialOffsetWriter
KafkaSource

OffsetSeqLog Write-Ahead Log (WAL) of stream execution engines

Creating HDFSMetadataLog Instance


HDFSMetadataLog takes the following to be created:

SparkSession

Path of the metadata log directory

While being created HDFSMetadataLog creates the path unless exists already.

437
HDFSMetadataLog

Serializing Metadata (Writing Metadata in Serialized


Format) —  serialize Method

serialize(
metadata: T,
out: OutputStream): Unit

serialize simply writes the log data (serialized using Json4s (with Jackson binding)

library).

serialize is used exclusively when HDFSMetadataLog is requested to write


Note metadata of a streaming batch to a file (metadata log) (when storing metadata
of a streaming batch).

Deserializing Metadata (Reading Metadata from Serialized


Format) —  deserialize Method

deserialize(in: InputStream): T

deserialize deserializes a metadata (of type T ) from a given InputStream .

deserialize is used exclusively when HDFSMetadataLog is requested to


Note
retrieve metadata of a batch.

Retrieving Metadata Of Streaming Batch —  get Method

get(batchId: Long): Option[T]

Note get is part of the MetadataLog Contract to get metadata of a batch.

get …​FIXME

Retrieving Metadata of Range of Batches —  get Method

get(
startId: Option[Long],
endId: Option[Long]): Array[(Long, T)]

Note get is part of the MetadataLog Contract to get metadata of range of batches.

438
HDFSMetadataLog

get …​FIXME

Persisting Metadata of Streaming Micro-Batch —  add


Method

add(
batchId: Long,
metadata: T): Boolean

add is part of the MetadataLog Contract to persist metadata of a streaming


Note
batch.

add return true when the metadata of the streaming batch was not available and

persisted successfully. Otherwise, add returns false .

Internally, add looks up metadata of the given streaming batch ( batchId ) and returns
false when found.

Otherwise, when not found, add creates a metadata log file for the given batchId and
writes metadata to the file. add returns true if successful.

Latest Committed Batch Id with Metadata (When Available) 


—  getLatest Method

getLatest(): Option[(Long, T)]

getLatest is a part of MetadataLog Contract to retrieve the recently-committed


Note
batch id and the corresponding metadata if available in the metadata storage.

getLatest requests the internal FileManager for the files in metadata directory that match

batch file filter.

getLatest takes the batch ids (the batch files correspond to) and sorts the ids in reverse

order.

getLatest gives the first batch id with the metadata which could be found in the metadata

storage.

It is possible that the batch id could be in the metadata storage, but not
Note
available for retrieval.

Removing Expired Metadata (Purging) —  purge Method

439
HDFSMetadataLog

purge(thresholdBatchId: Long): Unit

Note purge is part of the MetadataLog Contract to…​FIXME.

purge …​FIXME

Creating Batch Metadata File —  batchIdToPath Method

batchIdToPath(batchId: Long): Path

batchIdToPath simply creates a Hadoop Path for the file called by the specified batchId

under the metadata directory.

batchIdToPath is used when:

CompactibleFileStreamLog is requested to compact and allFiles


Note
HDFSMetadataLog is requested to add, get, purge, and purgeAfter

isBatchFile Method

isBatchFile(path: Path): Boolean

isBatchFile …​FIXME

isBatchFile is used exclusively when HDFSMetadataLog is requested for the


Note
PathFilter of batch files.

pathToBatchId Method

pathToBatchId(path: Path): Long

pathToBatchId …​FIXME

pathToBatchId is used when:

CompactibleFileStreamLog is requested for the compact interval


Note
HDFSMetadataLog is requested to isBatchFile, get metadata of a range of
batches, getLatest, getOrderedBatchFiles, purge, and purgeAfter

440
HDFSMetadataLog

verifyBatchIds Object Method

verifyBatchIds(
batchIds: Seq[Long],
startId: Option[Long],
endId: Option[Long]): Unit

verifyBatchIds …​FIXME

verifyBatchIds is used when:

FileStreamSourceLog is requested to get


Note
HDFSMetadataLog is requested to get

Retrieving Version (From Text Line) —  parseVersion


Internal Method

parseVersion(
text: String,
maxSupportedVersion: Int): Int

parseVersion …​FIXME

parseVersion is used when:

KafkaSourceInitialOffsetWriter is requested to deserialize metadata

KafkaSource is requested for the initial partition offsets

Note CommitLog is requested to deserialize metadata

CompactibleFileStreamLog is requested to deserialize metadata

OffsetSeqLog is requested to deserialize metadata

RateStreamMicroBatchReader is requested to deserialize metadata

purgeAfter Method

purgeAfter(thresholdBatchId: Long): Unit

purgeAfter …​FIXME

Note purgeAfter seems to be used exclusively in tests.

441
HDFSMetadataLog

Writing Batch Metadata to File (Metadata Log) 


—  writeBatchToFile Internal Method

writeBatchToFile(
metadata: T,
path: Path): Unit

writeBatchToFile requests the CheckpointFileManager to createAtomic (for the specified

path and the overwriteIfPossible flag disabled).

writeBatchToFile then serializes the metadata (to the CancellableFSDataOutputStream

output stream) and closes the stream.

In case of an exception, writeBatchToFile simply requests the


CancellableFSDataOutputStream output stream to cancel (so that the output file is not

generated) and re-throws the exception.

writeBatchToFile is used exclusively when HDFSMetadataLog is requested to


Note
store (persist) metadata of a streaming batch.

Retrieving Ordered Batch Metadata Files 


—  getOrderedBatchFiles Method

getOrderedBatchFiles(): Array[FileStatus]

getOrderedBatchFiles …​FIXME

Note getOrderedBatchFiles does not seem to be used at all.

Internal Properties

442
HDFSMetadataLog

Name Description

Hadoop’s PathFilter of batch files (with names being long


numbers)
Used when:
batchFilesFilter
CompactibleFileStreamLog is requested for the
compactInterval
HDFSMetadataLog is requested to get batch metadata,
getLatest, getOrderedBatchFiles, purge, and purgeAfter

CheckpointFileManager
fileManager
Used when…​FIXME

443
CommitLog

CommitLog — HDFSMetadataLog for Offset


Commit Log
CommitLog is an HDFSMetadataLog with CommitMetadata metadata.

CommitLog is created exclusively for the offset commit log of StreamExecution.

CommitLog uses CommitMetadata for the metadata with nextBatchWatermarkMs attribute

(of type Long and the default 0 ).

CommitLog writes commit metadata to files with names that are offsets.

$ ls -tr [checkpoint-directory]/commits
0 1 2 3 4 5 6 7 8 9

$ cat [checkpoint-directory]/commits/8
v1
{"nextBatchWatermarkMs": 0}

CommitLog uses 1 for the version.

CommitLog (like the parent HDFSMetadataLog) takes the following to be created:

SparkSession

Path of the metadata log directory

Serializing Metadata (Writing Metadata to Persistent


Storage) —  serialize Method

serialize(
metadata: CommitMetadata,
out: OutputStream): Unit

serialize is part of HDFSMetadataLog Contract to write a metadata in


Note
serialized format.

serialize writes out the version prefixed with v on a single line (e.g. v1 ) followed by the

given CommitMetadata in JSON format.

Deserializing Metadata —  deserialize Method

444
CommitLog

deserialize(in: InputStream): CommitMetadata

deserialize is part of HDFSMetadataLog Contract to deserialize a metadata


Note
(from an InputStream ).

deserialize simply reads (deserializes) two lines from the given InputStream for version

and the nextBatchWatermarkMs attribute.

add Method

add(batchId: Long): Unit

add …​FIXME

Note add is used when…​FIXME

add Method

add(batchId: Long, metadata: String): Boolean

Note add is part of MetadataLog Contract to…​FIXME.

add …​FIXME

445
CommitMetadata

CommitMetadata
CommitMetadata is…​FIXME

446
OffsetSeqLog

OffsetSeqLog — Hadoop DFS-based Metadata


Storage of OffsetSeqs
OffsetSeqLog is a Hadoop DFS-based metadata storage for OffsetSeq metadata.

OffsetSeqLog uses OffsetSeq for metadata which holds an ordered collection of offsets and

optional metadata (as OffsetSeqMetadata for event-time watermark).

OffsetSeqLog is created exclusively for the write-ahead log (WAL) of offsets of stream

execution engines (i.e. ContinuousExecution and MicroBatchExecution).

OffsetSeqLog uses 1 for the version when serializing and deserializing metadata.

Creating OffsetSeqLog Instance


OffsetSeqLog (like the parent HDFSMetadataLog) takes the following to be created:

SparkSession

Path of the metadata log directory

Serializing Metadata (Writing Metadata in Serialized


Format) —  serialize Method

serialize(
offsetSeq: OffsetSeq,
out: OutputStream): Unit

serialize is part of HDFSMetadataLog Contract to serialize metadata (write


Note
metadata in serialized format).

serialize firstly writes out the version prefixed with v on a single line (e.g. v1 ) followed

by the optional metadata in JSON format.

serialize then writes out the offsets in JSON format, one per line.

No offsets to write in offsetSeq for a streaming source is marked as - (a dash)


Note
in the log.

447
OffsetSeqLog

$ ls -tr [checkpoint-directory]/offsets
0 1 2 3 4 5 6

$ cat [checkpoint-directory]/offsets/6
v1
{"batchWatermarkMs":0,"batchTimestampMs":1502872590006,"conf":{"spark.sql.shuffle.part
itions":"200","spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.exe
cution.streaming.state.HDFSBackedStateStoreProvider"}}
51

Deserializing Metadata (Reading OffsetSeq from Serialized


Format) —  deserialize Method

deserialize(in: InputStream): OffsetSeq

deserialize is part of HDFSMetadataLog Contract to deserialize metadata


Note
(read metadata from serialized format).

deserialize firstly parses the version on the first line.

deserialize reads the optional metadata (with an empty line for metadata not available).

deserialize creates a SerializedOffset for every line left.

In the end, deserialize creates a OffsetSeq for the optional metadata and the
SerializedOffsets .

When there are no lines in the InputStream , deserialize throws an


IllegalStateException :

Incomplete log file

448
OffsetSeq

OffsetSeq
OffsetSeq is the metadata managed by Hadoop DFS-based metadata storage.

OffsetSeq is created (possibly using the fill factory methods) when:

OffsetSeqLog is requested to deserialize metadata (retrieve metadata from a persistent

storage)

StreamProgress is requested to convert itself to OffsetSeq (most importantly when

MicroBatchExecution stream execution engine is requested to construct the next

streaming micro-batch to commit available offsets for a batch to the write-ahead log)

ContinuousExecution stream execution engine is requested to get start offsets and

addOffset

Creating OffsetSeq Instance


OffsetSeq takes the following when created:

Collection of optional Offsets (with None for streaming sources with no new data
available)

Optional OffsetSeqMetadata (default: None )

Converting to StreamProgress —  toStreamProgress


Method

toStreamProgress(
sources: Seq[BaseStreamingSource]): StreamProgress

toStreamProgress creates a new StreamProgress and adds the streaming sources for which

there are new offsets available.

Offsets is a collection with holes (empty elements) for streaming sources with
Note
no new data available.

toStreamProgress throws an AssertionError if the number of the input sources does not

match the offsets:

There are [[offsets.size]] sources in the checkpoint offsets and now there are [[sourc
es.size]] sources requested by the query. Cannot continue.

449
OffsetSeq

toStreamProgress is used when:

MicroBatchExecution is requested to populate start offsets from offsets and

Note commits checkpoints and construct (or skip) the next streaming micro-
batch
ContinuousExecution is requested for start offsets

Textual Representation —  toString Method

toString: String

toString is part of the java.lang.Object contract for the string representation of


Note
the object.

toString simply converts the Offsets to JSON (if an offset is available) or - (a dash if an

offset is not available for a streaming source at that position).

Creating OffsetSeq Instance —  fill Factory Methods

fill(
offsets: Offset*): OffsetSeq (1)
fill(
metadata: Option[String],
offsets: Offset*): OffsetSeq

1. Uses no metadata ( None )

fill simply creates an OffsetSeq for the given variable sequence of Offsets and the

optional OffsetSeqMetadata (in JSON format).

fill is used when:

OffsetSeqLog is requested to deserialize metadata


Note
ContinuousExecution stream execution engine is requested to get start
offsets and addOffset

450
CompactibleFileStreamLog

CompactibleFileStreamLog Contract — 
Compactible Metadata Logs
CompactibleFileStreamLog is the extension of the HDFSMetadataLog contract for

compactible metadata logs that compactLogs every compact interval.

CompactibleFileStreamLog uses spark.sql.streaming.minBatchesToRetain configuration

property (default: 100 ) for deleteExpiredLog.

CompactibleFileStreamLog uses .compact suffix for batchIdToPath,

getBatchIdFromFileName, and the compactInterval.

Table 1. CompactibleFileStreamLog Contract (Abstract Methods Only)


Method Description

compactLogs(logs: Seq[T]): Seq[T]

compactLogs
Used when CompactibleFileStreamLog is requested to
compact and allFiles

defaultCompactInterval: Int

defaultCompactInterval
Default compaction interval
Used exclusively when CompactibleFileStreamLog is
requested for the compactInterval

fileCleanupDelayMs: Long

fileCleanupDelayMs
Used exclusively when CompactibleFileStreamLog is
requested to deleteExpiredLog

isDeletingExpiredLog: Boolean

isDeletingExpiredLog
Used exclusively when CompactibleFileStreamLog is
requested to store (add) metadata of a streaming batch

451
CompactibleFileStreamLog

Table 2. CompactibleFileStreamLogs
CompactibleFileStreamLog Description

FileStreamSinkLog

CompactibleFileStreamLog (of FileEntry metadata) of


FileStreamSourceLog
FileStreamSource

Creating CompactibleFileStreamLog Instance


CompactibleFileStreamLog takes the following to be created:

Metadata version

SparkSession

Path of the metadata log directory

CompactibleFileStreamLog is a Scala abstract class and cannot be created


Note
directly. It is created indirectly for the concrete CompactibleFileStreamLogs.

batchIdToPath Method

batchIdToPath(batchId: Long): Path

Note batchIdToPath is part of the HDFSMetadataLog Contract to…​FIXME.

batchIdToPath …​FIXME

pathToBatchId Method

pathToBatchId(path: Path): Long

Note pathToBatchId is part of the HDFSMetadataLog Contract to…​FIXME.

pathToBatchId …​FIXME

isBatchFile Method

isBatchFile(path: Path): Boolean

452
CompactibleFileStreamLog

Note isBatchFile is part of the HDFSMetadataLog Contract to…​FIXME.

isBatchFile …​FIXME

Serializing Metadata (Writing Metadata in Serialized


Format) —  serialize Method

serialize(
logData: Array[T],
out: OutputStream): Unit

serialize is part of the HDFSMetadataLog Contract to serialize metadata


Note
(write metadata in serialized format).

serialize firstly writes the version header ( v and the metadataLogVersion) out to the

given output stream (in UTF_8 ).

serialize then writes the log data (serialized using Json4s (with Jackson binding) library).

Entries are separated by new lines.

Deserializing Metadata —  deserialize Method

deserialize(in: InputStream): Array[T]

Note deserialize is part of the HDFSMetadataLog Contract to…​FIXME.

deserialize …​FIXME

Storing Metadata Of Streaming Batch —  add Method

add(
batchId: Long,
logs: Array[T]): Boolean

Note add is part of the HDFSMetadataLog Contract to store metadata for a batch.

add …​FIXME

allFiles Method

453
CompactibleFileStreamLog

allFiles(): Array[T]

allFiles …​FIXME

allFiles is used when:

FileStreamSource is created
Note
MetadataLogFileIndex is created

compact Internal Method

compact(
batchId: Long,
logs: Array[T]): Boolean

compact getValidBatchesBeforeCompactionBatch (with the streaming batch and the

compact interval).

compact …​FIXME

In the end, compact compactLogs and requests the parent HDFSMetadataLog to persist
metadata of a streaming batch (to a metadata log file).

compact is used exclusively when CompactibleFileStreamLog is requested to


Note
persist metadata of a streaming batch.

getValidBatchesBeforeCompactionBatch Object
Method

getValidBatchesBeforeCompactionBatch(
compactionBatchId: Long,
compactInterval: Int): Seq[Long]

getValidBatchesBeforeCompactionBatch …​FIXME

getValidBatchesBeforeCompactionBatch is used exclusively when


Note
CompactibleFileStreamLog is requested to compact.

isCompactionBatch Object Method

454
CompactibleFileStreamLog

isCompactionBatch(batchId: Long, compactInterval: Int): Boolean

isCompactionBatch …​FIXME

isCompactionBatch is used when:

CompactibleFileStreamLog is requested to batchIdToPath, store the

Note metadata of a batch, deleteExpiredLog, and


getValidBatchesBeforeCompactionBatch
FileStreamSourceLog is requested to store the metadata of a batch and get

getBatchIdFromFileName Object Method

getBatchIdFromFileName(fileName: String): Long

getBatchIdFromFileName simply removes the .compact suffix from the given fileName and

converts the remaining part to a number.

getBatchIdFromFileName is used when CompactibleFileStreamLog is requested


Note
to pathToBatchId, isBatchFile, and deleteExpiredLog.

deleteExpiredLog Internal Method

deleteExpiredLog(
currentBatchId: Long): Unit

deleteExpiredLog does nothing and simply returns when the current batch ID incremented

( currentBatchId + 1 ) is below the compact interval plus the minBatchesToRetain.

deleteExpiredLog …​FIXME

deleteExpiredLog is used exclusively when CompactibleFileStreamLog is


Note
requested to store metadata of a streaming batch.

Internal Properties

Name Description
compactInterval Compact interval

455
CompactibleFileStreamLog

456
FileStreamSourceLog

FileStreamSourceLog
FileStreamSourceLog is a concrete CompactibleFileStreamLog (of FileEntry metadata) of

FileStreamSource.

FileStreamSourceLog uses a fixed-size cache of metadata of compaction batches.

FileStreamSourceLog uses spark.sql.streaming.fileSource.log.compactInterval configuration

property (default: 10 ) for the default compaction interval.

FileStreamSourceLog uses spark.sql.streaming.fileSource.log.cleanupDelay configuration

property (default: 10 minutes) for the fileCleanupDelayMs.

FileStreamSourceLog uses spark.sql.streaming.fileSource.log.deletion configuration property

(default: true ) for the isDeletingExpiredLog.

Creating FileStreamSourceLog Instance


FileStreamSourceLog (like the parent CompactibleFileStreamLog) takes the following to be

created:

Metadata version

SparkSession

Path of the metadata log directory

Storing (Adding) Metadata of Streaming Batch —  add


Method

add(
batchId: Long,
logs: Array[FileEntry]): Boolean

add is part of the MetadataLog Contract to store (add) metadata of a


Note
streaming batch.

add requests the parent CompactibleFileStreamLog to store metadata (possibly compacting

logs if the batch is compaction).

If so (and this is a compation batch), add adds the batch and the logs to fileEntryCache
internal registry (and possibly removing the eldest entry if the size is above the cacheSize).

457
FileStreamSourceLog

get Method

get(
startId: Option[Long],
endId: Option[Long]): Array[(Long, Array[FileEntry])]

Note get is part of the MetadataLog Contract to…​FIXME.

get …​FIXME

Internal Properties

Name Description
Size of the fileEntryCache that is exactly the compact
interval
cacheSize
Used when the fileEntryCache is requested to add a new
entry in add and get a compaction batch

Metadata of a streaming batch ( FileEntry ) per batch ID


( LinkedHashMap[Long, Array[FileEntry]] ) of size configured
using the cacheSize
fileEntryCache
New entry added for a compaction batch when storing
(adding) metadata of a streaming batch
Used when get (for a compaction batch)

458
OffsetSeqMetadata

OffsetSeqMetadata — Metadata of Streaming
Batch
OffsetSeqMetadata holds the metadata for the current streaming batch:

Event-time watermark threshold

Batch timestamp (in millis)

Streaming configuration with spark.sql.shuffle.partitions and


spark.sql.streaming.stateStore.providerClass Spark properties

Note OffsetSeqMetadata is used mainly when IncrementalExecution is created.

OffsetSeqMetadata considers some configuration properties as relevantSQLConfs:

SHUFFLE_PARTITIONS

STATE_STORE_PROVIDER_CLASS

STREAMING_MULTIPLE_WATERMARK_POLICY

FLATMAPGROUPSWITHSTATE_STATE_FORMAT_VERSION

STREAMING_AGGREGATION_STATE_FORMAT_VERSION

relevantSQLConfs are used when OffsetSeqMetadata is created and is requested to

setSessionConf.

Creating OffsetSeqMetadata —  apply Factory Method

apply(
batchWatermarkMs: Long,
batchTimestampMs: Long,
sessionConf: RuntimeConfig): OffsetSeqMetadata

apply …​FIXME

Note apply is used when…​FIXME

setSessionConf Method

setSessionConf(metadata: OffsetSeqMetadata, sessionConf: RuntimeConfig): Unit

459
OffsetSeqMetadata

setSessionConf …​FIXME

Note setSessionConf is used when…​FIXME

460
CheckpointFileManager

CheckpointFileManager Contract
CheckpointFileManager is the abstraction of checkpoint managers that manage checkpoint

files (metadata of streaming batches) on Hadoop DFS-compatible file systems.

CheckpointFileManager is created per spark.sql.streaming.checkpointFileManagerClass

configuration property if defined before reverting to the available checkpoint managers.

CheckpointFileManager is used exclusively by HDFSMetadataLog, StreamMetadata and

HDFSBackedStateStoreProvider.

Table 1. CheckpointFileManager Contract


Method Description

createAtomic(
path: Path,
overwriteIfPossible: Boolean): CancellableFSDataOutpu
tStream

Used when:
HDFSMetadataLog is requested to store metadata for a
createAtomic batch (that writeBatchToFile)
StreamMetadata helper object is requested to persist
metadata
HDFSBackedStateStore is requested for the
deltaFileStream

HDFSBackedStateStoreProvider is requested to
writeSnapshotFile

delete(path: Path): Unit

Deletes the given path recursively (if exists)

Used when:
RenameBasedFSDataOutputStream is requested to cancel
delete
CompactibleFileStreamLog is requested to store
metadata for a batch (that deleteExpiredLog)
HDFSMetadataLog is requested to remove expired
metadata and purgeAfter
HDFSBackedStateStoreProvider is requested to do
maintenance (that cleans up)

461
CheckpointFileManager

exists(path: Path): Boolean

exists

Used when HDFSMetadataLog is created (to create the


metadata directory) and requested for metadata of a batch

isLocal: Boolean
isLocal

Does not seem to be used.

list(
path: Path): Array[FileStatus] (1)
list(
path: Path,
filter: PathFilter): Array[FileStatus]

1. Uses PathFilter that accepts all files in the path


Lists all files in the given path

list Used when:


HDFSBackedStateStoreProvider is requested for all delta
and snapshot files
CompactibleFileStreamLog is requested for the compact
interval and to deleteExpiredLog
HDFSMetadataLog is requested for metadata of one or
more batches, the latest committed batch, ordered
batch metadata files, to remove expired metadata and
purgeAfter

mkdirs(path: Path): Unit

mkdirs Used when:

HDFSMetadataLog is created

HDFSBackedStateStoreProvider is requested to initialize

open(path: Path): FSDataInputStream

Opens a file (by the given path) for reading


Used when:
open
HDFSMetadataLog is requested for metadata of a batch

462
CheckpointFileManager

HDFSBackedStateStoreProvider is requested to retrieve


the state store for a specified version (that
updateFromDeltaFile), and readSnapshotFile

Table 2. CheckpointFileManagers
CheckpointFileManager Description
Default CheckpointFileManager that uses
Hadoop’s FileContext API for managing
FileContextBasedCheckpointFileManager checkpoint files (unless
spark.sql.streaming.checkpointFileManagerClass
configuration property is used)

Basic CheckpointFileManager that uses


Hadoop’s FileSystem API for managing
checkpoint files (that assumes that the
FileSystemBasedCheckpointFileManager
implementation of FileSystem.rename() is
atomic or the correctness and fault-tolerance of
Structured Streaming is not guaranteed)

Creating CheckpointFileManager Instance —  create


Object Method

create(
path: Path,
hadoopConf: Configuration): CheckpointFileManager

create finds spark.sql.streaming.checkpointFileManagerClass configuration property in the

hadoopConf configuration.

If found, create simply instantiates whatever CheckpointFileManager implementation is


defined.

If not found, create creates a FileContextBasedCheckpointFileManager.

In case of UnsupportedFileSystemException , create prints out the following WARN


message to the logs and creates (falls back on) a FileSystemBasedCheckpointFileManager.

Could not use FileContext API for managing Structured Streaming checkpoint files at [p
ath]. Using FileSystem API instead for managing log files. If the implementation of Fi
leSystem.rename() is not atomic, then the correctness and fault-tolerance of your Stru
ctured Streaming is not guaranteed.

463
CheckpointFileManager

create is used when:

HDFSMetadataLog is created

Note StreamMetadata helper object is requested to write metadata to a file (when


StreamExecution is created)

HDFSBackedStateStoreProvider is requested for the CheckpointFileManager

464
FileContextBasedCheckpointFileManager

FileContextBasedCheckpointFileManager
FileContextBasedCheckpointFileManager is…​FIXME

465
FileSystemBasedCheckpointFileManager

FileSystemBasedCheckpointFileManager — 
CheckpointFileManager on Hadoop’s
FileSystem API
FileSystemBasedCheckpointFileManager is a CheckpointFileManager that uses Hadoop’s

FileSystem API for managing checkpoint files:

list uses FileSystem.listStatus

mkdirs uses FileSystem.mkdirs

createTempFile uses FileSystem.create (with overwrite enabled)

createAtomic uses RenameBasedFSDataOutputStream

open uses FileSystem.open

exists uses FileSystem.getFileStatus

renameTempFile uses FileSystem.rename

delete uses FileSystem.delete (with recursive enabled)

isLocal is true for the FileSystem being LocalFileSystem or RawLocalFileSystem

FileSystemBasedCheckpointFileManager is created exclusively when CheckpointFileManager

helper object is requested for a CheckpointFileManager (for HDFSMetadataLog,


StreamMetadata and HDFSBackedStateStoreProvider).

FileSystemBasedCheckpointFileManager is a RenameHelperMethods for atomicity by "write-to-

temp-file-and-rename".

Creating FileSystemBasedCheckpointFileManager Instance


FileSystemBasedCheckpointFileManager takes the following to be created:

Checkpoint directory (Hadoop’s Path)

Configuration (Hadoop’s Configuration)

FileSystemBasedCheckpointFileManager initializes the internal properties.

Internal Properties

466
FileSystemBasedCheckpointFileManager

Name Description
fs Hadoop’s FileSystem of the checkpoint directory

467
Offset

Offset — Read Position of Streaming Query


Offset is the base of stream positions that represent progress of a streaming query in json

format.

Table 1. Offset Contract (Abstract Methods Only)


Method Description

String json()

Converts the offset to JSON format (JSON-encoded offset)


Used when:
MicroBatchExecution stream execution engine is
requested to construct the next streaming micro-batch
and run a streaming micro-batch (with
json MicroBatchReader sources)
OffsetSeq is requested for the textual representation

OffsetSeqLog is requested to serialize metadata (write


metadata in serialized format)
ProgressReporter is requested to record trigger offsets

ContinuousExecution stream execution engine is


requested to run a streaming query in continuous mode
and commit an epoch

468
Offset

Table 2. Offsets
Offset Description

ContinuousMemoryStreamOffset

FileStreamSourceOffset

KafkaSourceOffset

LongOffset

RateStreamOffset

JSON-encoded offset that is used when loading an


SerializedOffset offset from an external storage, e.g. from
checkpoint after restart

TextSocketOffset

469
StreamProgress

StreamProgress — Collection of Offsets per


Streaming Source
StreamProgress is a collection of Offsets per streaming source.

StreamProgress is created when:

StreamExecution is created (and creates committed and available offsets)

OffsetSeq is requested to convert to StreamProgress

StreamProgress is an extension of Scala’s scala.collection.immutable.Map with streaming

sources as keys and their Offsets as values.

Creating StreamProgress Instance


StreamProgress takes the following to be created:

Optional collection of offsets per streaming source ( Map[BaseStreamingSource, Offset] )


(default: empty)

Looking Up Offset by Streaming Source —  get Method

get(key: BaseStreamingSource): Option[Offset]

Note get is part of the Scala’s scala.collection.MapLike to…​FIXME.

get simply looks up an Offsets for the given BaseStreamingSource in the baseMap.

++ Method

++(
updates: GenTraversableOnce[(BaseStreamingSource, Offset)]): StreamProgress

++ simply creates a new StreamProgress with the baseMap and the given updates.

++ is used exclusively when OffsetSeq is requested to convert to


Note
StreamProgress.

Converting to OffsetSeq —  toOffsetSeq Method

470
StreamProgress

toOffsetSeq(
sources: Seq[BaseStreamingSource],
metadata: OffsetSeqMetadata): OffsetSeq

toOffsetSeq creates a OffsetSeq with offsets that are looked up for every

BaseStreamingSource.

toOffsetSeq is used when:

MicroBatchExecution stream execution engine is requested to construct


the next streaming micro-batch (to commit available offsets for a batch to
Note the write-ahead log)
StreamExecution is requested to run stream processing (that failed with a
Throwable)

471
Micro-Batch Stream Processing

Micro-Batch Stream Processing (Structured


Streaming V1)
Micro-Batch Stream Processing is a stream processing model in Spark Structured
Streaming that is used for streaming queries with Trigger.Once and Trigger.ProcessingTime
triggers.

Micro-batch stream processing uses MicroBatchExecution stream execution engine.

Micro-batch stream processing supports MicroBatchReadSupport data sources.

Micro-batch stream processing is often referred to as Structured Streaming V1.

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.ProcessingTime(1.minute)) // <-- Uses MicroBatchExecution for execu
tion
.queryName("rate2console")
.start

assert(sq.isActive)

scala> sq.explain
== Physical Plan ==
WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@
678e6267
+- *(1) Project [timestamp#54, value#55L]
+- *(1) ScanV2 rate[timestamp#54, value#55L]

// sq.stop

Execution Phases (Processing Cycle)


Once MicroBatchExecution stream processing engine is requested to run an activated
streaming query, the query execution goes through the following execution phases every
trigger:

1. triggerExecution

472
Micro-Batch Stream Processing

2. getOffset for Sources or setOffsetRange for MicroBatchReaders

3. getEndOffset

4. walCommit

5. getBatch

6. queryPlanning

7. addBatch

Execution phases with execution times are available using StreamingQueryProgress under
durationMs .

scala> :type sq
org.apache.spark.sql.streaming.StreamingQuery
sq.lastProgress.durationMs.get("walCommit")

Enable INFO logging level for StreamExecution logger to be notified about


Tip
durations.

17/08/11 09:04:17 INFO StreamExecution: Streaming query made progress: {


"id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
"runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
"name" : "rates-to-console",
"timestamp" : "2017-08-11T07:04:17.373Z",
"batchId" : 0,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : { // <-- Durations (in millis)
"addBatch" : 38,
"getBatch" : 1,
"getOffset" : 0,
"queryPlanning" : 1,
"triggerExecution" : 62,
"walCommit" : 19
},

Monitoring (using StreamingQueryListener and Logs)


MicroBatchExecution posts events to announce when a streaming query is started and

stopped as well as after every micro-batch. StreamingQueryListener interface can be used


to intercept the events and act accordingly.

After triggerExecution phase MicroBatchExecution is requested to finish up a streaming


batch (trigger) and generate a StreamingQueryProgress (with execution statistics).

473
Micro-Batch Stream Processing

MicroBatchExecution prints out the following DEBUG message to the logs:

Execution stats: [executionStats]

MicroBatchExecution posts a QueryProgressEvent with the StreamingQueryProgress and

prints out the following INFO message to the logs:

Streaming query made progress: [newProgress]

474
MicroBatchExecution

MicroBatchExecution — Stream Execution
Engine of Micro-Batch Stream Processing
MicroBatchExecution is the stream execution engine in Micro-Batch Stream Processing.

MicroBatchExecution is created when StreamingQueryManager is requested to create a

streaming query (when DataStreamWriter is requested to start an execution of the


streaming query) with the following:

Any type of sink but StreamWriteSupport

Any type of trigger but ContinuousTrigger

import org.apache.spark.sql.streaming.Trigger
val query = spark
.readStream
.format("rate")
.load
.writeStream
.format("console") // <-- not a StreamWriteSupport sink
.option("truncate", false)
.trigger(Trigger.Once) // <-- Gives MicroBatchExecution
.queryName("rate2console")
.start

// The following gives access to the internals


// And to MicroBatchExecution
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val engine = query.asInstanceOf[StreamingQueryWrapper].streamingQuery
import org.apache.spark.sql.execution.streaming.StreamExecution
assert(engine.isInstanceOf[StreamExecution])

import org.apache.spark.sql.execution.streaming.MicroBatchExecution
val microBatchEngine = engine.asInstanceOf[MicroBatchExecution]
assert(microBatchEngine.trigger == Trigger.Once)

Once created, MicroBatchExecution (as a stream execution engine) is requested to run an


activated streaming query.

475
MicroBatchExecution

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.MicroBatchExecution to see what
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.MicroBatchExecution=ALL

Refer to Logging.

Creating MicroBatchExecution Instance


MicroBatchExecution takes the following to be created:

SparkSession

Name of the streaming query

Path of the checkpoint directory

Analyzed logical query plan of the streaming query ( LogicalPlan )

Streaming sink

Trigger

Trigger clock ( Clock )

Output mode

Extra options ( Map[String, String] )

deleteCheckpointOnStop flag to control whether to delete the checkpoint directory on


stop

MicroBatchExecution initializes the internal properties.

MicroBatchExecution and TriggerExecutor 


—  triggerExecutor Property

triggerExecutor: TriggerExecutor

triggerExecutor is the TriggerExecutor of the streaming query that is how micro-batches

are executed at regular intervals.

476
MicroBatchExecution

triggerExecutor is initialized based on the given Trigger (that was used to create the

MicroBatchExecution ):

ProcessingTimeExecutor for Trigger.ProcessingTime

OneTimeExecutor for OneTimeTrigger (aka Trigger.Once trigger)

triggerExecutor throws an IllegalStateException when the Trigger is not one of the built-

in implementations.

Unknown type of trigger: [trigger]

triggerExecutor is used exclusively when StreamExecution is requested to run


Note
an activated streaming query (at regular intervals).

Running Activated Streaming Query 


—  runActivatedStream Method

runActivatedStream(
sparkSessionForStream: SparkSession): Unit

runActivatedStream is part of StreamExecution Contract to run the activated


Note
streaming query.

runActivatedStream simply requests the TriggerExecutor to execute micro-batches using

the batch runner (until MicroBatchExecution is terminated due to a query stop or a failure).

TriggerExecutor’s Batch Runner


The batch runner (of the TriggerExecutor) is executed as long as the MicroBatchExecution
is active.

Note trigger and batch are considered equivalent and used interchangeably.

The batch runner initializes query progress for the new trigger (aka startTrigger).

The batch runner starts triggerExecution execution phase that is made up of the following
steps:

1. Populating start offsets from checkpoint before the first "zero" batch (at every start or
restart)

2. Constructing or skipping the next streaming micro-batch

477
MicroBatchExecution

3. Running the streaming micro-batch

At the start or restart (resume) of a streaming query (when the current batch ID is
uninitialized and -1 ), the batch runner populates start offsets from checkpoint and then
prints out the following INFO message to the logs (using the committedOffsets internal
registry):

Stream started from [committedOffsets]

The batch runner sets the human-readable description for any Spark job submitted (that
streaming sources may submit to get new data) as the batch description.

The batch runner constructs the next streaming micro-batch (when the
isCurrentBatchConstructed internal flag is off).

The batch runner records trigger offsets (with the committed and available offsets).

The batch runner updates the current StreamingQueryStatus with the isNewDataAvailable
for isDataAvailable property.

With the isCurrentBatchConstructed flag enabled ( true ), the batch runner updates the
status message to one of the following (per isNewDataAvailable) and runs the streaming
micro-batch.

Processing new data

No new data but cleaning up state

With the isCurrentBatchConstructed flag disabled ( false ), the batch runner simply updates
the status message to the following:

Waiting for data to arrive

The batch runner finalizes query progress for the trigger (with a flag that indicates whether
the current batch had new data).

With the isCurrentBatchConstructed flag enabled ( true ), the batch runner increments the
currentBatchId and turns the isCurrentBatchConstructed flag off ( false ).

With the isCurrentBatchConstructed flag disabled ( false ), the batch runner simply sleeps
(as long as configured using the spark.sql.streaming.pollingDelay configuration property).

478
MicroBatchExecution

In the end, the batch runner updates the status message to the following status and returns
whether the MicroBatchExecution is active or not.

Waiting for next trigger

Populating Start Offsets From Checkpoint (Resuming from


Checkpoint) —  populateStartOffsets Internal Method

populateStartOffsets(
sparkSessionToRunBatches: SparkSession): Unit

populateStartOffsets requests the Offset Write-Ahead Log for the latest committed batch id

with metadata (i.e. OffsetSeq).

The batch id could not be available in the write-ahead log when a streaming
Note query started with a new log or no batch was persisted (added) to the log
before.

populateStartOffsets branches off based on whether the latest committed batch was

available or not.

populateStartOffsets is used exclusively when MicroBatchExecution is


Note requested to run an activated streaming query (before the first "zero" micro-
batch).

Latest Committed Batch Available


When the latest committed batch id with the metadata was available in the Offset Write-
Ahead Log, populateStartOffsets (re)initializes the internal state as follows:

Sets the current batch ID to the latest committed batch ID found

Turns the isCurrentBatchConstructed internal flag on ( true )

Sets the available offsets to the offsets (from the metadata)

When the latest batch ID found is greater than 0 , populateStartOffsets requests the
Offset Write-Ahead Log for the second latest batch ID with metadata or throws an
IllegalStateException if not found.

batch [latestBatchId - 1] doesn't exist

populateStartOffsets sets the committed offsets to the second latest committed offsets.

479
MicroBatchExecution

populateStartOffsets updates the offset metadata.

Caution FIXME Describe me

populateStartOffsets requests the Offset Commit Log for the latest committed batch id with

metadata (i.e. CommitMetadata).

Caution FIXME Describe me

When the latest committed batch id with metadata was found which is exactly the latest
batch ID (found in the Offset Commit Log), populateStartOffsets …​FIXME

When the latest committed batch id with metadata was found, but it is not exactly the second
latest batch ID (found in the Offset Commit Log), populateStartOffsets prints out the
following WARN message to the logs:

Batch completion log latest batch id is


[latestCommittedBatchId], which is not trailing batchid
[latestBatchId] by one

When no commit log present in the Offset Commit Log, populateStartOffsets prints out the
following INFO message to the logs:

no commit log present

In the end, populateStartOffsets prints out the following DEBUG message to the logs:

Resuming at batch [currentBatchId] with committed offsets


[committedOffsets] and available offsets [availableOffsets]

No Latest Committed Batch


When the latest committed batch id with the metadata could not be found in the Offset Write-
Ahead Log, it is assumed that the streaming query is started for the very first time (or the
checkpoint location has changed).

populateStartOffsets prints out the following INFO message to the logs:

Starting new streaming query.

480
MicroBatchExecution

populateStartOffsets sets the current batch ID to 0 and creates a new

WatermarkTracker.

Constructing Or Skipping Next Streaming Micro-Batch 


—  constructNextBatch Internal Method

constructNextBatch(
noDataBatchesEnabled: Boolean): Boolean

constructNextBatch will only be executed when the isCurrentBatchConstructed


Note
internal flag is enabled ( true ).

constructNextBatch performs the following steps:

1. Requesting the latest offsets from every streaming source (of the streaming query)

2. Updating availableOffsets StreamProgress with the latest available offsets

3. Updating batch metadata with the current event-time watermark and batch timestamp

4. Checking whether to construct the next micro-batch or not (skip it)

In the end, constructNextBatch returns whether the next streaming micro-batch was
constructed or skipped.

constructNextBatch is used exclusively when MicroBatchExecution is


Note
requested to run the activated streaming query.

Requesting Latest Offsets from Streaming Sources


(getOffset, setOffsetRange and getEndOffset Phases)
constructNextBatch firstly requests every streaming source for the latest offsets.

constructNextBatch checks out the latest offset in every streaming data source
Note
sequentially, i.e. one data source at a time.

481
MicroBatchExecution

Figure 1. MicroBatchExecution’s Getting Offsets From Streaming Sources


For every streaming source (Data Source API V1), constructNextBatch updates the status
message to the following:

Getting offsets from [source]

In getOffset time-tracking section, constructNextBatch requests the Source for the latest
offset.

For every MicroBatchReader (Data Source API V2), constructNextBatch updates the status
message to the following:

Getting offsets from [source]

In setOffsetRange time-tracking section, constructNextBatch finds the available offsets of


the source (in the available offset internal registry) and, if found, requests the
MicroBatchReader to deserialize the offset (from JSON format). constructNextBatch

requests the MicroBatchReader to set the desired offset range.

In getEndOffset time-tracking section, constructNextBatch requests the MicroBatchReader


for the end offset.

Updating availableOffsets StreamProgress with Latest


Available Offsets
constructNextBatch updates the availableOffsets StreamProgress with the latest reported

offsets.

482
MicroBatchExecution

Updating Batch Metadata with Current Event-Time


Watermark and Batch Timestamp
constructNextBatch updates the batch metadata with the current event-time watermark

(from the WatermarkTracker) and the batch timestamp.

Checking Whether to Construct Next Micro-Batch or Not


(Skip It)
constructNextBatch checks whether or not the next streaming micro-batch should be

constructed ( lastExecutionRequiresAnotherBatch ).

constructNextBatch uses the last IncrementalExecution if the last execution requires

another micro-batch (using the batch metadata) and the given noDataBatchesEnabled flag is
enabled ( true ).

constructNextBatch also checks out whether new data is available (based on available and

committed offsets).

shouldConstructNextBatch local flag is enabled ( true ) when there is new data


Note available (based on offsets) or the last execution requires another micro-batch
(and the given noDataBatchesEnabled flag is enabled).

constructNextBatch prints out the following TRACE message to the logs:

noDataBatchesEnabled = [noDataBatchesEnabled],
lastExecutionRequiresAnotherBatch =
[lastExecutionRequiresAnotherBatch], isNewDataAvailable =
[isNewDataAvailable], shouldConstructNextBatch =
[shouldConstructNextBatch]

constructNextBatch branches off per whether to constructs or skip the next batch (per

shouldConstructNextBatch flag in the above TRACE message).

Constructing Next Micro-Batch —  shouldConstructNextBatch


Flag Enabled
With the shouldConstructNextBatch flag enabled ( true ), constructNextBatch updates the
status message to the following:

Writing offsets to log

483
MicroBatchExecution

In walCommit time-tracking section, constructNextBatch requests the availableOffsets


StreamProgress to convert to OffsetSeq (with the BaseStreamingSources and the current
batch metadata (event-time watermark and timestamp)) that is in turn added to the write-
ahead log for the current batch ID.

constructNextBatch prints out the following INFO message to the logs:

Committed offsets for batch [currentBatchId]. Metadata [offsetSeqMetadata]

Note FIXME ( if (currentBatchId != 0) …​ )

Note FIXME ( if (minLogEntriesToMaintain < currentBatchId) …​ )

constructNextBatch turns the noNewData internal flag off ( false ).

In case of a failure while adding the available offsets to the write-ahead log,
constructNextBatch throws an AssertionError :

Concurrent update to the log. Multiple streaming jobs detected for [currentBatchId]

Skipping Next Micro-Batch —  shouldConstructNextBatch


Flag Disabled
With the shouldConstructNextBatch flag disabled ( false ), constructNextBatch turns the
noNewData flag on ( true ) and wakes up (notifies) all threads waiting for the
awaitProgressLockCondition lock.

Running Single Streaming Micro-Batch —  runBatch


Internal Method

runBatch(
sparkSessionToRunBatch: SparkSession): Unit

runBatch prints out the following DEBUG message to the logs (with the current batch ID):

Running batch [currentBatchId]

runBatch then performs the following steps (aka phases):

1. getBatch Phase — Creating Logical Query Plans For Unprocessed Data From Sources
and MicroBatchReaders

484
MicroBatchExecution

2. Transforming Logical Plan to Include Sources and MicroBatchReaders with New Data

3. Transforming CurrentTimestamp and CurrentDate Expressions (Per Batch Metadata)

4. Adapting Transformed Logical Plan to Sink with StreamWriteSupport

5. Setting Local Properties

6. queryPlanning Phase — Creating and Preparing IncrementalExecution for Execution

7. nextBatch Phase — Creating DataFrame (with IncrementalExecution for New Data)

8. addBatch Phase — Adding DataFrame With New Data to Sink

9. Updating Watermark and Committing Offsets to Offset Commit Log

In the end, runBatch prints out the following DEBUG message to the logs (with the current
batch ID):

Completed batch [currentBatchId]

runBatch is used exclusively when MicroBatchExecution is requested to run an


Note
activated streaming query (and there is new data to process).

getBatch Phase — Creating Logical Query Plans For


Unprocessed Data From Sources and MicroBatchReaders
In getBatch time-tracking section, runBatch goes over the available offsets and processes
every Source and MicroBatchReader (associated with the available offsets) to create logical
query plans ( newData ) for data processing (per offset ranges).

runBatch requests sources and readers for data per offset range sequentially,
Note
one by one.

Figure 2. StreamExecution’s Running Single Streaming Batch (getBatch Phase)

485
MicroBatchExecution

getBatch Phase and Sources


For a Source (with the available offsets different from the committedOffsets registry),
runBatch does the following:

Requests the committedOffsets for the committed offsets for the Source (if available)

Requests the Source for a dataframe for the offset range (the current and available
offsets)

runBatch prints out the following DEBUG message to the logs.

Retrieving data from [source]: [current] -> [available]

In the end, runBatch returns the Source and the logical plan of the streaming dataset (for
the offset range).

In case the Source returns a dataframe that is not streaming, runBatch throws an
AssertionError :

DataFrame returned by getBatch from [source] did not have isStreaming=true\n[logicalQu


eryPlan]

getBatch Phase and MicroBatchReaders


For a MicroBatchReader (with the available offsets different from the committedOffsets
registry), runBatch does the following:

Requests the committedOffsets for the committed offsets for the MicroBatchReader (if
available)

Requests the MicroBatchReader to deserialize the committed offsets (if available)

Requests the MicroBatchReader to deserialize the available offsets (only for


SerializedOffsets)

Requests the MicroBatchReader to set the offset range (the current and available
offsets)

runBatch prints out the following DEBUG message to the logs.

Retrieving data from [reader]: [current] -> [availableV2]

486
MicroBatchExecution

runBatch looks up the DataSourceV2 and the options for the MicroBatchReader (in the

readerToDataSourceMap internal registry).

In the end, runBatch requests the MicroBatchReader for the read schema and creates a
StreamingDataSourceV2Relation logical operator (with the read schema, the DataSourceV2 ,

options, and the MicroBatchReader ).

Transforming Logical Plan to Include Sources and


MicroBatchReaders with New Data

Figure 3. StreamExecution’s Running Single Streaming Batch (and Transforming Logical


Plan for New Data)
runBatch transforms the analyzed logical plan to include Sources and MicroBatchReaders

with new data ( newBatchesPlan with logical plans to process data that has arrived since the
last batch).

487
MicroBatchExecution

For every StreamingExecutionRelation (with a Source or MicroBatchReader), runBatch


tries to find the corresponding logical plan for processing new data.

StreamingExecutionRelation logical operator is used to represent a streaming


Note
source or reader in the logical query plan (of a streaming query).

If the logical plan is found, runBatch makes the plan a child operator of Project (with
Aliases ) logical operator and replaces the StreamingExecutionRelation .

Otherwise, if not found, runBatch simply creates an empty streaming LocalRelation (for
scanning data from an empty local collection).

In case the number of columns in dataframes with new data and


StreamingExecutionRelation 's do not match, runBatch throws an AssertionError :

Invalid batch: [output] != [dataPlan.output]

Transforming CurrentTimestamp and CurrentDate


Expressions (Per Batch Metadata)
runBatch replaces all CurrentTimestamp and CurrentDate expressions in the transformed

logical plan (with new data) with the current batch timestamp (based on the batch metadata).

CurrentTimestamp and CurrentDate expressions correspond to


current_timestamp and current_date standard function, respectively.
Note
Read up The Internals of Spark SQL to learn more about the standard
functions.

Adapting Transformed Logical Plan to Sink with


StreamWriteSupport
runBatch adapts the transformed logical plan (with new data and current batch timestamp)

for the new StreamWriteSupport sinks (per the type of the BaseStreamingSink).

For a StreamWriteSupport (Data Source API V2), runBatch requests the


StreamWriteSupport for a StreamWriter (for the runId, the output schema, the OutputMode,

and the extra options). runBatch then creates a WriteToDataSourceV2 logical operator with
a new MicroBatchWriter as a child operator (for the current batch ID and the StreamWriter).

For a Sink (Data Source API V1), runBatch changes nothing.

For any other BaseStreamingSink type, runBatch simply throws an


IllegalArgumentException :

488
MicroBatchExecution

unknown sink type for [sink]

Setting Local Properties


runBatch sets the local properties.

Table 1. runBatch’s Local Properties


Local Property Value
streaming.sql.batchId currentBatchId

__is_continuous_processing false

queryPlanning Phase — Creating and Preparing


IncrementalExecution for Execution

Figure 4. StreamExecution’s Query Planning (queryPlanning Phase)


In queryPlanning time-tracking section, runBatch creates a new IncrementalExecution
with the following:

Transformed logical plan

Output mode

state checkpoint directory

Run id

Batch id

Batch Metadata (Event-Time Watermark and Timestamp)

In the end (of the queryPlanning phase), runBatch requests the IncrementalExecution to
prepare the transformed logical plan for execution (i.e. execute the executedPlan query
execution phase).

Read up on the executedPlan query execution phase in The Internals of Spark


Tip
SQL.

489
MicroBatchExecution

nextBatch Phase — Creating DataFrame (with


IncrementalExecution for New Data)

Figure 5. StreamExecution Creates DataFrame with New Data


runBatch creates a new DataFrame with the new IncrementalExecution.

The DataFrame represents the result of executing the current micro-batch of the streaming
query.

addBatch Phase — Adding DataFrame With New Data to


Sink

Figure 6. StreamExecution Adds DataFrame With New Data to Sink


In addBatch time-tracking section, runBatch adds the DataFrame with new data to the
BaseStreamingSink.

For a Sink (Data Source API V1), runBatch simply requests the Sink to add the
DataFrame (with the batch ID).

For a StreamWriteSupport (Data Source API V2), runBatch simply requests the DataFrame
with new data to collect (which simply forces execution of the MicroBatchWriter).

runBatch uses SQLExecution.withNewExecutionId to execute and track all the


Note Spark jobs under one execution id (so it is reported as one single multi-job
execution, e.g. in web UI).

490
MicroBatchExecution

SQLExecution.withNewExecutionId posts a SparkListenerSQLExecutionStart


Note event before execution and a SparkListenerSQLExecutionEnd event right
afterwards.

Register SparkListener to get notified about the SQL execution events


( SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd ).
Tip
Read up on SparkListener in The Internals of Apache Spark.

Updating Watermark and Committing Offsets to Offset


Commit Log
runBatch requests the WatermarkTracker to update event-time watermark (with the

executedPlan of the IncrementalExecution).

runBatch requests the Offset Commit Log to persisting metadata of the streaming micro-

batch (with the current batch ID and event-time watermark of the WatermarkTracker).

In the end, runBatch adds the available offsets to the committed offsets (and updates the
offsets of every BaseStreamingSource with new data in the current micro-batch).

Stopping Stream Processing (Execution of Streaming


Query) —  stop Method

stop(): Unit

Note stop is part of the StreamingQuery Contract to stop a streaming query.

stop sets the state to TERMINATED.

When the stream execution thread is alive, stop requests the current SparkContext to
cancelJobGroup identified by the runId and waits for this thread to die. Just to make sure

that there are no more streaming jobs, stop requests the current SparkContext to
cancelJobGroup identified by the runId again.

In the end, stop prints out the following INFO message to the logs:

Query [prettyIdString] was stopped

491
MicroBatchExecution

Checking Whether New Data Is Available (Based on


Available and Committed Offsets) 
—  isNewDataAvailable Internal Method

isNewDataAvailable: Boolean

isNewDataAvailable checks whether there is a streaming source (in the available offsets) for

which committed offsets are different from the available offsets or not available (committed)
at all.

isNewDataAvailable is positive ( true ) when there is at least one such streaming source.

isNewDataAvailable is used when MicroBatchExecution is requested to run an


Note
activated streaming query and construct the next streaming micro-batch.

Analyzed Logical Plan With Unique


StreamingExecutionRelation Operators —  logicalPlan
Lazy Property

logicalPlan: LogicalPlan

logicalPlan is part of StreamExecution Contract to be the analyzed logical


Note
plan of the streaming query.

logicalPlan resolves (replaces) StreamingRelation, StreamingRelationV2 logical operators

to StreamingExecutionRelation logical operators. logicalPlan uses the transformed logical


plan to set the uniqueSources and sources internal registries to be the
BaseStreamingSources of all the StreamingExecutionRelations unique and not, respectively.

logicalPlan is a Scala lazy value and so the initialization is guaranteed to


Note
happen only once at the first access (and is cached for later use afterwards).

Internally, logicalPlan transforms the analyzed logical plan.

For every StreamingRelation logical operator, logicalPlan tries to replace it with the
StreamingExecutionRelation that was used earlier for the same StreamingRelation (if used
multiple times in the plan) or creates a new one. While creating a new
StreamingExecutionRelation , logicalPlan requests the DataSource to create a streaming

Source with the metadata path as sources/uniqueID directory in the checkpoint root
directory. logicalPlan prints out the following INFO message to the logs:

492
MicroBatchExecution

Using Source [source] from DataSourceV1 named '[sourceName]' [dataSourceV1]

For every StreamingRelationV2 logical operator with a MicroBatchReadSupport data source


(which is not on the list of spark.sql.streaming.disabledV2MicroBatchReaders), logicalPlan
tries to replace it with the StreamingExecutionRelation that was used earlier for the same
StreamingRelationV2 (if used multiple times in the plan) or creates a new one. While

creating a new StreamingExecutionRelation , logicalPlan requests the


MicroBatchReadSupport to create a MicroBatchReader with the metadata path as

sources/uniqueID directory in the checkpoint root directory. logicalPlan prints out the

following INFO message to the logs:

Using MicroBatchReader [reader] from DataSourceV2 named '[sourceName]' [dataSourceV2]

For every other StreamingRelationV2 logical operator, logicalPlan tries to replace it with
the StreamingExecutionRelation that was used earlier for the same StreamingRelationV2 (if
used multiple times in the plan) or creates a new one. While creating a new
StreamingExecutionRelation , logicalPlan requests the StreamingRelation for the

underlying DataSource that is in turn requested to create a streaming Source with the
metadata path as sources/uniqueID directory in the checkpoint root directory. logicalPlan
prints out the following INFO message to the logs:

Using Source [source] from DataSourceV2 named '[sourceName]' [dataSourceV2]

logicalPlan requests the transformed analyzed logical plan for all

StreamingExecutionRelations that are then requested for BaseStreamingSources, and

saves them as the sources internal registry.

In the end, logicalPlan sets the uniqueSources internal registry to be the unique
BaseStreamingSources above.

logicalPlan throws an AssertionError when not executed on the stream execution thread.

logicalPlan must be initialized in QueryExecutionThread but the current thread was [cu
rrentThread]

streaming.sql.batchId Local Property


MicroBatchExecution defines streaming.sql.batchId as the name of the local property to be

the current batch or epoch IDs (that Spark tasks can use)

streaming.sql.batchId is used when:

493
MicroBatchExecution

MicroBatchExecution is requested to run a single streaming micro-batch (and sets the

property to be the current batch ID)

DataWritingSparkTask is requested to run (and needs an epoch ID)

Internal Properties

Name Description
Flag to control whether to run a streaming micro-batch
( true ) or not ( false )
Default: false

When disabled ( false ), changed to whatever


constructing the next streaming micro-batch gives
back when running activated streaming query
Disabled ( false ) after running a streaming micro-
batch (when enabled after constructing the next
streaming micro-batch)
isCurrentBatchConstructed
Enabled ( true ) when populating start offsets (when
running an activated streaming query) and re-
starting a streaming query from a checkpoint (using
the Offset Write-Ahead Log)

Disabled ( false ) when populating start offsets


(when running an activated streaming query) and re-
starting a streaming query from a checkpoint when
the latest offset checkpointed (written) to the offset
write-ahead log has been successfully processed
and committed to the Offset Commit Log

readerToDataSourceMap
( Map[MicroBatchReader, (DataSourceV2, Map[String,
String])] )

Streaming sources and readers (of the


StreamingExecutionRelations of the analyzed logical
query plan of the streaming query)

Default: (empty)

sources is part of the ProgressReporter


Note Contract for the streaming sources of the
streaming query.

sources Initialized when MicroBatchExecution is requested


for the transformed logical query plan
Used when:

494
MicroBatchExecution

Populating start offsets (for the available and


committed offsets)
Constructing or skipping next streaming micro-batch
(and persisting offsets to write-ahead log)

WatermarkTracker that is created when


MicroBatchExecution is requested to populate start
watermarkTracker
offsets (when requested to run an activated streaming
query)

495
MicroBatchWriter

MicroBatchWriter — Data Source Writer in


Micro-Batch Stream Processing (Data Source
API V2)
MicroBatchWriter is a DataSourceWriter (Spark SQL) that uses the given batch ID as the

epoch when requested to commit, abort and create a WriterFactory for a given
StreamWriter in Micro-Batch Stream Processing.

Tip Read up on DataSourceWriter in The Internals of Spark SQL book.

MicroBatchWriter is part of the novel Data Source API V2 in Spark SQL.

MicroBatchWriter is created exclusively when MicroBatchExecution is requested to run a

streaming batch (with a StreamWriteSupport streaming sink).

496
MicroBatchReadSupport

MicroBatchReadSupport Contract — Data
Sources with MicroBatchReaders
MicroBatchReadSupport is the extension of the DataSourceV2 for data sources with a

MicroBatchReader for Micro-Batch Stream Processing.

MicroBatchReadSupport defines a single createMicroBatchReader method to create a

MicroBatchReader.

MicroBatchReader createMicroBatchReader(
Optional<StructType> schema,
String checkpointLocation,
DataSourceOptions options)

createMicroBatchReader is used when:

MicroBatchExecution is requested for the analyzed logical plan (and creates a

StreamingExecutionRelation for a StreamingRelationV2 with a MicroBatchReadSupport


data source)

DataStreamReader is requested to create a streaming query for a

MicroBatchReadSupport data source

Table 1. MicroBatchReadSupports
MicroBatchReadSupport Description
KafkaSourceProvider Data source provider for kafka format

RateStreamProvider Data source provider for rate format

TextSocketSourceProvider Data source provider for socket format

497
MicroBatchReader

MicroBatchReader Contract — Data Source


Readers in Micro-Batch Stream Processing
(Data Source API V2)
MicroBatchReader is the extension of Spark SQL’s DataSourceReader (and

BaseStreamingSource) contracts for data source readers in Micro-Batch Stream Processing.

MicroBatchReader is part of the novel Data Source API V2 in Spark SQL.

Tip Read up on Data Source API V2 in The Internals of Spark SQL book.

498
MicroBatchReader

Table 1. MicroBatchReader Contract


Method Description

void commit(Offset end)


commit

Used when…​FIXME

Offset deserializeOffset(String json)

deserializeOffset
Deserializes offset (from JSON format)
Used when…​FIXME

Offset getEndOffset()

getEndOffset
End offset of this reader
Used when…​FIXME

Offset getStartOffset()

getStartOffset
Start (beginning) offsets of this reader

Used when…​FIXME

void setOffsetRange(
Optional<Offset> start,
Optional<Offset> end)

setOffsetRange
Sets the desired offset range for input partitions created from this
reader (for data scan)

Used when…​FIXME

Table 2. MicroBatchReaders
MicroBatchReader Description

KafkaMicroBatchReader

MemoryStream

RateStreamMicroBatchReader

TextSocketMicroBatchReader

499
MicroBatchReader

500
WatermarkTracker

WatermarkTracker
WatermarkTracker tracks the event-time watermark of a streaming query (across

EventTimeWatermarkExec operators in a physical query plan) based on a given


MultipleWatermarkPolicy.

WatermarkTracker is used exclusively in MicroBatchExecution.

WatermarkTracker is created (using the factory method) when MicroBatchExecution is

requested to populate start offsets (when requested to run an activated streaming query).

WatermarkTracker takes a single MultipleWatermarkPolicy to be created.

MultipleWatermarkPolicy can be one of the following:

MaxWatermark (alias: min )

MinWatermark (alias: max )

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.WatermarkTracker to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.WatermarkTracker=ALL

Refer to Logging.

Creating WatermarkTracker —  apply Factory Method

apply(conf: RuntimeConfig): WatermarkTracker

apply uses the spark.sql.streaming.multipleWatermarkPolicy configuration property for the

global watermark policy (default: min ) and creates a WatermarkTracker.

apply is used exclusively when MicroBatchExecution is requested to populate


Note
start offsets (when requested to run an activated streaming query).

setWatermark Method

setWatermark(newWatermarkMs: Long): Unit

501
WatermarkTracker

setWatermark simply updates the global event-time watermark to the given newWatermarkMs .

setWatermark is used exclusively when MicroBatchExecution is requested to


Note
populate start offsets (when requested to run an activated streaming query).

Updating Event-Time Watermark —  updateWatermark


Method

updateWatermark(executedPlan: SparkPlan): Unit

updateWatermark requests the given physical operator ( SparkPlan ) to collect all

EventTimeWatermarkExec unary physical operators.

updateWatermark simply exits when no EventTimeWatermarkExec was found.

updateWatermark …​FIXME

updateWatermark is used exclusively when MicroBatchExecution is requested


Note to run a single streaming batch (when requested to run an activated streaming
query).

Internal Properties

Name Description
Current global event-time watermark per
MultipleWatermarkPolicy (across all
EventTimeWatermarkExec operators in a physical query
globalWatermarkMs plan)
Default: 0

Used when…​FIXME

Event-time watermark per EventTimeWatermarkExec


operatorToWatermarkMap physical operator ( mutable.HashMap[Int, Long] )

Used when…​FIXME

502
Source

Source Contract — Streaming Sources for


Micro-Batch Stream Processing (Data Source
API V1)
Source is the extension of the BaseStreamingSource contract for streaming sources that

work with "continuous" stream of data identified by offsets.

Source is part of Data Source API V1 and used in Micro-Batch Stream Processing only.

For fault tolerance, Source must be able to replay an arbitrary sequence of past data in a
stream using a range of offsets. This is the assumption so Structured Streaming can achieve
end-to-end exactly-once guarantees.

Table 1. Source Contract


Method Description

commit(end: Offset): Unit

Commits data up to the end offset, i.e. informs the source


that Spark has completed processing all data for offsets less
commit than or equal to the end offset and will only request offsets
greater than the end offset in the future.
Used exclusively when MicroBatchExecution stream
execution engine (Micro-Batch Stream Processing) is
requested to write offsets to a commit log (walCommit
phase) while running an activated streaming query.

getBatch(
start: Option[Offset],
end: Offset): DataFrame

Generating a streaming DataFrame with data between the


start and end offsets

Start offset can be undefined ( None ) to indicate that the


getBatch
batch should begin with the first record
Used when MicroBatchExecution stream execution engine
(Micro-Batch Stream Processing) is requested to run an
activated streaming query, namely:

Populate start offsets from checkpoint (resuming from


checkpoint)
Request unprocessed data from all sources (getBatch
phase)

503
Source

getOffset: Option[Offset]

Latest (maximum) offset of the source (or None to denote


getOffset no data)
Used exclusively when MicroBatchExecution stream
execution engine (Micro-Batch Stream Processing) is
requested for latest offsets of all sources (getOffset phase)
while running activated streaming query.

schema: StructType

Schema of the source


schema

schema seems to be used for tests only and a


Note duplication of
StreamSourceProvider.sourceSchema.

Table 2. Sources
Source Description
FileStreamSource Part of file-based data sources ( FileFormat )

KafkaSource Part of kafka data source

504
StreamSourceProvider

StreamSourceProvider Contract — Streaming
Source Providers for Micro-Batch Stream
Processing (Data Source API V1)
StreamSourceProvider is the contract of data source providers that can create a streaming

source for a format (e.g. text file) or system (e.g. Apache Kafka).

StreamSourceProvider is part of Data Source API V1 and used in Micro-Batch Stream

Processing only.

Table 1. StreamSourceProvider Contract


Method Description

createSource(
sqlContext: SQLContext,
metadataPath: String,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): Source

Creates a streaming source


createSource

metadataPath is the value of the optional user-


Note specified checkpointLocation option or
resolved by StreamingQueryManager.

Used exclusively when DataSource is requested to create a


streaming source (when MicroBatchExecution is requested
to initialize the analyzed logical plan)

sourceSchema(
sqlContext: SQLContext,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): (String, StructType)

sourceSchema

The name and schema of the streaming source


Used exclusively when DataSource is requested for
metadata of a streaming source (when MicroBatchExecution
is requested to initialize the analyzed logical plan)

KafkaSourceProvider is the only known StreamSourceProvider in Spark


Note
Structured Streaming.

505
StreamSourceProvider

506
Sink

Sink Contract — Streaming Sinks for Micro-


Batch Stream Processing
Sink is the extension of the BaseStreamingSink contract for streaming sinks that can add

batches to an output.

Sink is part of Data Source API V1 and used in Micro-Batch Stream Processing only.

Table 1. Sink Contract


Method Description

addBatch(
batchId: Long,
data: DataFrame): Unit

addBatch Adds a batch of data to the sink


Used exclusively when MicroBatchExecution stream
execution engine (Micro-Batch Stream Processing) is
requested to add a streaming batch to a sink (addBatch
phase) while running an activated streaming query.

Table 2. Sinks
Sink Description
FileStreamSink Used in file-based data sources ( FileFormat )

ForeachBatchSink Used for DataStreamWriter.foreachBatch streaming operator

KafkaSink Used for kafka output format

MemorySink Used for memory output format

507
StreamSinkProvider

StreamSinkProvider Contract
StreamSinkProvider is the abstraction of providers that can create a streaming sink for a file

format (e.g. parquet ) or system (e.g. kafka ).

StreamWriteSupport is a newer version of StreamSinkProvider (aka


Important DataSource API V2 ) and new data sources should use the contract
instead.

Table 1. StreamSinkProvider Contract


Method Description

createSink(
sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink

createSink

Creates a streaming sink


Used exclusively when DataSource is requested for a
streaming sink (when DataStreamWriter is requested to start
a streaming query)

KafkaSourceProvider is the only known StreamSinkProvider in Spark


Note
Structured Streaming.

508
Continuous Stream Processing

Continuous Stream Processing (Structured


Streaming V2)
Continuous Stream Processing is one of the two stream processing engines in Spark
Structured Streaming that is used for execution of structured streaming queries with
Trigger.Continuous trigger.

The other feature-richer stream processing engine is Micro-Batch Stream


Note
Processing.

Continuous Stream Processing execution engine uses the novel Data Source API V2
(Spark SQL) and for the very first time makes stream processing truly continuous.

Tip Read up on Data Source API V2 in The Internals of Spark SQL book.

Because of the two innovative changes Continuous Stream Processing is often referred to
as Structured Streaming V2.

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.Continuous(15.seconds)) // <-- Uses ContinuousExecution for executi
on
.queryName("rate2console")
.start

scala> :type sq
org.apache.spark.sql.streaming.StreamingQuery

assert(sq.isActive)

// sq.stop

Under the covers, Continuous Stream Processing uses ContinuousExecution stream


execution engine. When requested to run an activated streaming query,
ContinuousExecution adds WriteToContinuousDataSourceExec physical operator as the

top-level operator in the physical query plan of the streaming query.

509
Continuous Stream Processing

scala> :type sq
org.apache.spark.sql.streaming.StreamingQuery

scala> sq.explain
== Physical Plan ==
WriteToContinuousDataSource ConsoleWriter[numRows=20, truncate=false]
+- *(1) Project [timestamp#758, value#759L]
+- *(1) ScanV2 rate[timestamp#758, value#759L]

From now on, you may think of a streaming query as a soon-to-be-generated


ContinuousWriteRDD - an RDD data structure that Spark developers use to describe a
distributed computation.

When the streaming query is started (and the top-level WriteToContinuousDataSourceExec


physical operator is requested to execute and generate a recipe for a distributed
computation (as an RDD[InternalRow])), it simply requests the underlying
ContinuousWriteRDD to collect.

That collect operator is how a Spark job is run (as tasks over all partitions of the RDD) as
described by the ContinuousWriteRDD.compute "protocol" (a recipe for the tasks to be
scheduled to run on Spark executors).

Figure 1. Creating Instance of StreamExecution


While the tasks are computing partitions (of the ContinuousWriteRDD ), they keep running
until killed or completed. And that’s the ingenious design trick of how the streaming query
(as a Spark job with the distributed tasks running on executors) runs continuously and
indefinitely.

When DataStreamReader is requested to create a streaming query for a


ContinuousReadSupport data source, it creates…​FIXME

510
ContinuousExecution

ContinuousExecution — Stream Execution
Engine of Continuous Stream Processing
ContinuousExecution is the stream execution engine of Continuous Stream Processing.

ContinuousExecution is created when StreamingQueryManager is requested to create a

streaming query with a StreamWriteSupport sink and a ContinuousTrigger (when


DataStreamWriter is requested to start an execution of the streaming query).

ContinuousExecution can only run streaming queries with StreamingRelationV2 with

ContinuousReadSupport data source.

ContinuousExecution supports one ContinuousReader only in a streaming query (and

asserts it when addOffset and committing an epoch). When requested for available
streaming sources, ContinuousExecution simply gives the single ContinuousReader.

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.Continuous(1.minute)) // <-- Gives ContinuousExecution
.queryName("rate2console")
.start

import org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isInstanceOf[StreamingQuery])

// The following gives access to the internals


// And to ContinuousExecution
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val engine = sq.asInstanceOf[StreamingQueryWrapper].streamingQuery
import org.apache.spark.sql.execution.streaming.StreamExecution
assert(engine.isInstanceOf[StreamExecution])

import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
val continuousEngine = engine.asInstanceOf[ContinuousExecution]
assert(continuousEngine.trigger == Trigger.Continuous(1.minute))

511
ContinuousExecution

When created (for a streaming query), ContinuousExecution is given the analyzed logical
plan. The analyzed logical plan is immediately transformed to include a
ContinuousExecutionRelation for every StreamingRelationV2 with ContinuousReadSupport
data source (and is the logical plan internally).

ContinuousExecution uses the same instance of ContinuousExecutionRelation


Note for the same instances of StreamingRelationV2 with ContinuousReadSupport
data source.

When requested to run the streaming query, ContinuousExecution collects


ContinuousReadSupport data sources (inside ContinuousExecutionRelation) from the
analyzed logical plan and requests each and every ContinuousReadSupport to create a
ContinuousReader (that are stored in continuousSources internal registry).

ContinuousExecution uses __epoch_coordinator_id local property for…​FIXME

ContinuousExecution uses __continuous_start_epoch local property for…​FIXME

ContinuousExecution uses __continuous_epoch_interval local property for…​FIXME

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution=ALL

Refer to Logging.

Running Activated Streaming Query 


—  runActivatedStream Method

runActivatedStream(sparkSessionForStream: SparkSession): Unit

runActivatedStream is part of StreamExecution Contract to run a streaming


Note
query.

runActivatedStream simply runs the streaming query in continuous mode as long as the

state is ACTIVE.

Running Streaming Query in Continuous Mode 


—  runContinuous Internal Method

512
ContinuousExecution

runContinuous(sparkSessionForQuery: SparkSession): Unit

runContinuous initializes the continuousSources internal registry by traversing the analyzed

logical plan to find ContinuousExecutionRelation leaf logical operators and requests their
ContinuousReadSupport data sources to create a ContinuousReader (with the sources
metadata directory under the checkpoint directory).

runContinuous initializes the uniqueSources internal registry to be the continuousSources

distinct.

runContinuous gets the start offsets (they may or may not be available).

runContinuous transforms the analyzed logical plan. For every

ContinuousExecutionRelation runContinuous finds the corresponding ContinuousReader (in


the continuousSources), requests it to deserialize the start offsets (from their JSON
representation), and then setStartOffset. In the end, runContinuous creates a
StreamingDataSourceV2Relation (with the read schema of the ContinuousReader and the

ContinuousReader itself).

runContinuous rewires the transformed plan (with the StreamingDataSourceV2Relation ) to

use the new attributes from the source (the reader).

CurrentTimestamp and CurrentDate expressions are not supported for


Note
continuous processing.

runContinuous requests the StreamWriteSupport to create a StreamWriter (with the run ID

of the streaming query).

runContinuous creates a WriteToContinuousDataSource (with the StreamWriter and the

transformed logical query plan).

runContinuous finds the only ContinuousReader (of the only

StreamingDataSourceV2Relation ) in the query plan with the WriteToContinuousDataSource .

In queryPlanning time-tracking section, runContinuous creates an IncrementalExecution


(that becomes the lastExecution) that is immediately executed (i.e. the entire query
execution pipeline is executed up to and including executedPlan).

runContinuous sets the following local properties:

__is_continuous_processing as true

__continuous_start_epoch as the currentBatchId

__epoch_coordinator_id as the currentEpochCoordinatorId, i.e. runId followed by --


with a random UUID

513
ContinuousExecution

__continuous_epoch_interval as the interval of the ContinuousTrigger

runContinuous uses the EpochCoordinatorRef helper to create a remote reference to the

EpochCoordinator RPC endpoint (with the StreamWriter, the ContinuousReader, the


currentEpochCoordinatorId, and the currentBatchId).

The EpochCoordinator RPC endpoint runs on the driver as the single point to
Note
coordinate epochs across partition tasks.

runContinuous creates a daemon epoch update thread and starts it immediately.

In runContinuous time-tracking section, runContinuous requests the physical query plan


(of the IncrementalExecution) to execute (that simply requests the physical operator to
doExecute and generate an RDD[InternalRow] ).

runContinuous is used exclusively when ContinuousExecution is requested to


Note
run an activated streaming query.

Epoch Update Thread


runContinuous creates an epoch update thread that…​FIXME

Getting Start Offsets From Checkpoint —  getStartOffsets


Internal Method

getStartOffsets(sparkSessionToRunBatches: SparkSession): OffsetSeq

getStartOffsets …​FIXME

getStartOffsets is used exclusively when ContinuousExecution is requested


Note
to run a streaming query in continuous mode.

Committing Epoch —  commit Method

commit(epoch: Long): Unit

In essence, commit adds the given epoch to commit log and the committedOffsets, and
requests the ContinuousReader to commit the corresponding offset. In the end, commit
removes old log entries from the offset and commit logs (to keep
spark.sql.streaming.minBatchesToRetain entries only).

514
ContinuousExecution

Internally, commit recordTriggerOffsets (with the from and to offsets as the


committedOffsets and availableOffsets, respectively).

At this point, commit may simply return when the stream execution thread is no longer alive
(died).

commit requests the commit log to store a metadata for the epoch.

commit requests the single ContinuousReader to deserialize the offset for the epoch (from

the offset write-ahead log).

commit adds the single ContinuousReader and the offset (for the epoch) to the

committedOffsets registry.

commit requests the single ContinuousReader to commit the offset.

commit requests the offset and commit logs to remove log entries to keep

spark.sql.streaming.minBatchesToRetain only.

commit then acquires the awaitProgressLock, wakes up all threads waiting for the

awaitProgressLockCondition and in the end releases the awaitProgressLock.

commit supports only one continuous source (registered in the


Note
continuousSources internal registry).

commit asserts that the given epoch is available in the offsetLog internal registry (i.e. the

offset for the given epoch has been reported before).

commit is used exclusively when EpochCoordinator is requested to


Note
commitEpoch.

addOffset Method

addOffset(
epoch: Long,
reader: ContinuousReader,
partitionOffsets: Seq[PartitionOffset]): Unit

In essense, addOffset requests the given ContinuousReader to mergeOffsets (with the


given PartitionOffsets ) and then requests the OffsetSeqLog to register the offset with the
given epoch.

515
ContinuousExecution

Figure 1. ContinuousExecution.addOffset
Internally, addOffset requests the given ContinuousReader to mergeOffsets (with the given
PartitionOffsets ) and to get the current "global" offset back.

addOffset then requests the OffsetSeqLog to add the current "global" offset for the given

epoch .

addOffset requests the OffsetSeqLog for the offset at the previous epoch.

If the offsets at the current and previous epochs are the same, addOffset turns the
noNewData internal flag on.

addOffset then acquires the awaitProgressLock, wakes up all threads waiting for the

awaitProgressLockCondition and in the end releases the awaitProgressLock.

Note addOffset supports exactly one continuous source.

addOffset is used exclusively when EpochCoordinator is requested to handle


Note
a ReportPartitionOffset message.

Analyzed Logical Plan of Streaming Query 


—  logicalPlan Property

logicalPlan: LogicalPlan

logicalPlan is part of StreamExecution Contract that is the analyzed logical


Note
plan of the streaming query.

logicalPlan resolves StreamingRelationV2 leaf logical operators (with a

ContinuousReadSupport source) to ContinuousExecutionRelation leaf logical operators.

Internally, logicalPlan transforms the analyzed logical plan as follows:

516
ContinuousExecution

1. For every StreamingRelationV2 leaf logical operator with a ContinuousReadSupport


source, logicalPlan looks it up for the corresponding ContinuousExecutionRelation (if
available in the internal lookup registry) or creates a ContinuousExecutionRelation (with
the ContinuousReadSupport source, the options and the output attributes of the
StreamingRelationV2 operator)

2. For any other StreamingRelationV2 , logicalPlan throws an


UnsupportedOperationException :

Data source [name] does not support continuous processing.

Creating ContinuousExecution Instance


ContinuousExecution takes the following when created:

SparkSession

The name of the structured query

Path to the checkpoint directory (aka metadata directory)

Analyzed logical query plan ( LogicalPlan )

StreamWriteSupport

Trigger

Clock

Output mode

Options ( Map[String, String] )

deleteCheckpointOnStop flag to control whether to delete the checkpoint directory on


stop

ContinuousExecution initializes the internal properties.

Stopping Stream Processing (Execution of Streaming


Query) —  stop Method

stop(): Unit

Note stop is part of the StreamingQuery Contract to stop a streaming query.

517
ContinuousExecution

stop transitions the streaming query to TERMINATED state.

If the queryExecutionThread is alive (i.e. it has been started and has not yet died), stop
interrupts it and waits for this thread to die.

In the end, stop prints out the following INFO message to the logs:

Query [prettyIdString] was stopped

Note prettyIdString is in the format of queryName [id = [id], runId = [runId]] .

awaitEpoch Internal Method

awaitEpoch(epoch: Long): Unit

awaitEpoch …​FIXME

Note awaitEpoch seems to be used exclusively in tests.

Internal Properties

518
ContinuousExecution

Name Description

continuousSources: Seq[ContinuousReader]

Registry of ContinuousReaders (in the analyzed logical


plan of the streaming query)
continuousSources
As asserted in commit and addOffset there could only be
exactly one ContinuousReaders registered.
Used when ContinuousExecution is requested to
commit, getStartOffsets, and runContinuous
Use sources to access the current value

FIXME
currentEpochCoordinatorId
Used when…​FIXME

TriggerExecutor for the Trigger:


ProcessingTimeExecutor for ContinuousTrigger

triggerExecutor
Used when…​FIXME

StreamExecution throws an
Note IllegalStateException when the Trigger is
not a ContinuousTrigger.

519
ContinuousReadSupport Contract

ContinuousReadSupport Contract — Data
Sources with ContinuousReaders
ContinuousReadSupport is the extension of the DataSourceV2 for data sources with a

ContinuousReader for Continuous Stream Processing.

ContinuousReadSupport defines a single createContinuousReader method to create a

ContinuousReader.

ContinuousReader createContinuousReader(
Optional<StructType> schema,
String checkpointLocation,
DataSourceOptions options)

createContinuousReader is used when:

ContinuousExecution is requested to run a streaming query (and finds

ContinuousExecutionRelations in the analyzed logical plan)

DataStreamReader is requested to create a streaming query for a

ContinuousReadSupport data source

Table 1. ContinuousReadSupports
ContinuousReadSupport Description
ContinuousMemoryStream Data source provider for memory format

KafkaSourceProvider Data source provider for kafka format

RateStreamProvider Data source provider for rate format

TextSocketSourceProvider Data source provider for socket format

520
ContinuousReader Contract

ContinuousReader Contract — Data Source


Readers in Continuous Stream Processing
ContinuousReader is the extension of Spark SQL’s DataSourceReader (and

BaseStreamingSource) contracts for data source readers in Continuous Stream Processing.

ContinuousReader is part of the novel Data Source API V2 in Spark SQL.

Tip Read up on Data Source API V2 in The Internals of Spark SQL book.

521
ContinuousReader Contract

Table 1. ContinuousReader Contract


Method Description

void commit(Offset end)

commit
Commits the specified offset
Used exclusively when ContinuousExecution is requested to
commit an epoch

Offset deserializeOffset(String json)

deserializeOffset
Deserializes an offset from JSON representation
Used when ContinuousExecution is requested to run a
streaming query and commit an epoch

Offset getStartOffset()

getStartOffset

Note Used exclusively in tests.

Offset mergeOffsets(PartitionOffset[] offsets)

mergeOffsets
Used exclusively when ContinuousExecution is requested to
addOffset

boolean needsReconfiguration()

needsReconfiguration Indicates that the reader needs reconfiguration (e.g. to


generate new input partitions)

Used exclusively when ContinuousExecution is requested to


run a streaming query in continuous mode

void setStartOffset(Optional<Offset> start)

setStartOffset
Used exclusively when ContinuousExecution is requested to
run the streaming query in continuous mode.

522
ContinuousReader Contract

Table 2. ContinuousReaders
ContinuousReader Description

ContinuousMemoryStream

KafkaContinuousReader

RateStreamContinuousReader

TextSocketContinuousReader

523
RateStreamContinuousReader

RateStreamContinuousReader
RateStreamContinuousReader is a ContinuousReader that…​FIXME

524
EpochCoordinator RPC Endpoint

EpochCoordinator RPC Endpoint — 


Coordinating Epochs and Offsets Across
Partition Tasks
EpochCoordinator is a ThreadSafeRpcEndpoint that tracks offsets and epochs (coordinates

epochs) by handling messages (in fire-and-forget one-way and request-response two-way


modes) from…​FIXME

EpochCoordinator is created (using create factory method) when ContinuousExecution is

requested to run a streaming query in continuous mode.

525
EpochCoordinator RPC Endpoint

Table 1. EpochCoordinator RPC Endpoint’s Messages


Message Description
CommitPartitionEpoch

Partition ID Sent out (in one-way asynchronous mode)


exclusively when ContinuousWriteRDD is requested
Epoch to compute a partition (after all rows were written
down to a streaming sink)
DataSource API V2’s
WriterCommitMessage

Sent out (in request-response synchronous mode)


GetCurrentEpoch exclusively when EpochMarkerGenerator thread is
requested to run

Sent out (in request-response synchronous mode)


IncrementAndGetEpoch
exclusively when ContinuousExecution is requested
to run a streaming query in continuous mode (and
start a separate epoch update thread)

ReportPartitionOffset
Sent out (in one-way asynchronous mode)
Partition ID exclusively when ContinuousQueuedDataReader is
Epoch requested for the next row to be read in the current
epoch, and the epoch is done
PartitionOffset

Sent out (in request-response synchronous mode)


exclusively when DataSourceV2ScanExec leaf physical
SetReaderPartitions
operator is requested for the input RDDs (for a
ContinuousReader and is about to create a
Number of partitions ContinuousDataSourceRDD)
The number of partitions is exactly the number of
InputPartitions from the ContinuousReader .

Sent out (in request-response synchronous mode)


exclusively when WriteToContinuousDataSourceExec
SetWriterPartitions leaf physical operator is requested to execute and
generate a recipe for a distributed computation (as
Number of partitions an RDD[InternalRow]) (and requests a
ContinuousWriteRDD to collect that simply never
finishes…​and that’s the trick of continuous mode)

Sent out (in request-response synchronous mode)


StopContinuousExecutionWrites
exclusively when ContinuousExecution is requested
to run a streaming query in continuous mode (and it
finishes successfully or not)

526
EpochCoordinator RPC Endpoint

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.continuous.EpochCoordinatorRef* logger to see what
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.continuous.EpochCoordinatorRef*=ALL

Refer to Logging.

Receiving Messages (Fire-And-Forget One-Way Mode) 


—  receive Method

receive: PartialFunction[Any, Unit]

receive is part of the RpcEndpoint Contract in Apache Spark to receive


Note
messages in fire-and-forget one-way mode.

receive handles the following messages:

CommitPartitionEpoch

ReportPartitionOffset

With the queryWritesStopped turned on, receive simply swallows messages and does
nothing.

Receiving Messages (Request-Response Two-Way Mode) 


—  receiveAndReply Method

receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit]

receiveAndReply is part of the RpcEndpoint Contract in Apache Spark to


Note
receive and reply to messages in request-response two-way mode.

receiveAndReply handles the following messages:

GetCurrentEpoch

IncrementAndGetEpoch

SetReaderPartitions

527
EpochCoordinator RPC Endpoint

SetWriterPartitions

StopContinuousExecutionWrites

resolveCommitsAtEpoch Internal Method

resolveCommitsAtEpoch(epoch: Long): Unit

resolveCommitsAtEpoch …​FIXME

resolveCommitsAtEpoch is used exclusively when EpochCoordinator is


Note requested to handle CommitPartitionEpoch and ReportPartitionOffset
messages.

commitEpoch Internal Method

commitEpoch(
epoch: Long,
messages: Iterable[WriterCommitMessage]): Unit

commitEpoch …​FIXME

commitEpoch is used exclusively when EpochCoordinator is requested to


Note
resolveCommitsAtEpoch.

Creating EpochCoordinator Instance


EpochCoordinator takes the following to be created:

StreamWriter

ContinuousReader

ContinuousExecution

Start epoch

SparkSession

RpcEnv

EpochCoordinator initializes the internal properties.

Registering EpochCoordinator RPC Endpoint —  create


528
EpochCoordinator RPC Endpoint

Registering EpochCoordinator RPC Endpoint —  create


Factory Method

create(
writer: StreamWriter,
reader: ContinuousReader,
query: ContinuousExecution,
epochCoordinatorId: String,
startEpoch: Long,
session: SparkSession,
env: SparkEnv): RpcEndpointRef

create simply creates a new EpochCoordinator and requests the RpcEnv to register a

RPC endpoint as EpochCoordinator-[id] (where id is the given epochCoordinatorId ).

create prints out the following INFO message to the logs:

Registered EpochCoordinator endpoint

create is used exclusively when ContinuousExecution is requested to run a


Note
streaming query in continuous mode.

Internal Properties
Name Description

Flag that indicates whether to drop messages ( true ) or not


( false ) when requested to handle one synchronously
queryWritesStopped Default: false
Turned on ( true ) when requested to handle a synchronous
StopContinuousExecutionWrites message

529
EpochCoordinatorRef

EpochCoordinatorRef
EpochCoordinatorRef is…​FIXME

Creating Remote Reference to EpochCoordinator RPC


Endpoint —  create Factory Method

create(
writer: StreamWriter,
reader: ContinuousReader,
query: ContinuousExecution,
epochCoordinatorId: String,
startEpoch: Long,
session: SparkSession,
env: SparkEnv): RpcEndpointRef

create …​FIXME

create is used exclusively when ContinuousExecution is requested to run a


Note
streaming query in continuous mode.

Getting Remote Reference to EpochCoordinator RPC


Endpoint —  get Factory Method

get(id: String, env: SparkEnv): RpcEndpointRef

get …​FIXME

get is used when:

DataSourceV2ScanExec leaf physical operator is requested for the input


RDDs (and creates a ContinuousDataSourceRDD for a ContinuousReader)
ContinuousQueuedDataReader is created (and initializes the
epochCoordEndpoint)
Note
EpochMarkerGenerator is created (and initializes the epochCoordEndpoint)

ContinuousWriteRDD is requested to compute a partition

WriteToContinuousDataSourceExec is requested to execute and generate a


recipe for a distributed computation (as an RDD[InternalRow])

530
EpochCoordinatorRef

531
EpochTracker

EpochTracker
EpochTracker is…​FIXME

Current Epoch —  getCurrentEpoch Method

getCurrentEpoch: Option[Long]

getCurrentEpoch …​FIXME

Note getCurrentEpoch is used when…​FIXME

Advancing (Incrementing) Epoch 


—  incrementCurrentEpoch Method

incrementCurrentEpoch(): Unit

incrementCurrentEpoch …​FIXME

Note incrementCurrentEpoch is used when…​FIXME

532
ContinuousQueuedDataReader

ContinuousQueuedDataReader
ContinuousQueuedDataReader is created exclusively when ContinuousDataSourceRDD is

requested to compute a partition.

ContinuousQueuedDataReader uses two types of continuous records:

EpochMarker

ContinuousRow (with the InternalRow at PartitionOffset )

Fetching Next Row —  next Method

next(): InternalRow

next …​FIXME

Note next is used when…​FIXME

Closing ContinuousQueuedDataReader —  close Method

close(): Unit

close is part of the java.io.Closeable to close this stream and release any
Note
system resources associated with it.

close …​FIXME

Creating ContinuousQueuedDataReader Instance


ContinuousQueuedDataReader takes the following to be created:

ContinuousDataSourceRDDPartition

TaskContext

Size of the data queue

epochPollIntervalMs

ContinuousQueuedDataReader initializes the internal properties.

533
ContinuousQueuedDataReader

Internal Properties

Name Description

Epoch Coordinator Identifier


coordinatorId
Used when…​FIXME

PartitionOffset
currentOffset
Used when…​FIXME

DataReaderThread daemon thread that is created and


started immediately when ContinuousQueuedDataReader is
dataReaderThread
created
Used when…​FIXME

RpcEndpointRef of the EpochCoordinator per coordinatorId


epochCoordEndpoint
Used when…​FIXME

java.util.concurrent.ScheduledExecutorService
epochMarkerExecutor
Used when…​FIXME

EpochMarkerGenerator
epochMarkerGenerator
Used when…​FIXME

InputPartitionReader
reader
Used when…​FIXME

java.util.concurrent.ArrayBlockingQueue of
queue ContinuousRecords (of the given data size)

Used when…​FIXME

534
DataReaderThread

DataReaderThread
DataReaderThread is…​FIXME

535
EpochMarkerGenerator

EpochMarkerGenerator Thread
EpochMarkerGenerator is…​FIXME

run Method

run(): Unit

run is part of the java.lang.Runnable Contract to be executed upon starting a


Note
thread.

run …​FIXME

536
PartitionOffset

PartitionOffset
PartitionOffset is…​FIXME

537
ContinuousExecutionRelation Leaf Logical Operator

ContinuousExecutionRelation Leaf Logical


Operator
ContinuousExecutionRelation is a MultiInstanceRelation leaf logical operator.

Tip Read up on Leaf Logical Operators in The Internals of Spark SQL book.

ContinuousExecutionRelation is created (to represent StreamingRelationV2 with

ContinuousReadSupport data source) when ContinuousExecution is created (and requested


for the logical plan).

ContinuousExecutionRelation takes the following to be created:

ContinuousReadSupport source

Options ( Map[String, String] )

Output attributes ( Seq[Attribute] )

SparkSession

538
WriteToContinuousDataSource Unary Logical Operator

WriteToContinuousDataSource Unary Logical


Operator
WriteToContinuousDataSource is a unary logical operator ( LogicalPlan ) that is created

exclusively when ContinuousExecution is requested to run a streaming query in continuous


mode (to create an IncrementalExecution).

WriteToContinuousDataSource is planned (translated) to a

WriteToContinuousDataSourceExec unary physical operator (when DataSourceV2Strategy


execution planning strategy is requested to plan a logical query).

Read up on DataSourceV2Strategy Execution Planning Strategy in The Internals


Tip
of Spark SQL book.

WriteToContinuousDataSource takes the following to be created:

StreamWriter

Child logical operator ( LogicalPlan )

WriteToContinuousDataSource uses empty output schema (which is exactly to say that no

output is expected whatsoever).

539
WriteToContinuousDataSourceExec Unary Physical Operator

WriteToContinuousDataSourceExec Unary
Physical Operator
WriteToContinuousDataSourceExec is a unary physical operator that creates a

ContinuousWriteRDD for continuous write.

A unary physical operator ( UnaryExecNode ) is a physical operator with a single


child physical operator.
Note
Read up on UnaryExecNode (and physical operators in general) in The
Internals of Spark SQL book.

WriteToContinuousDataSourceExec is created exclusively when DataSourceV2Strategy

execution planning strategy is requested to plan a WriteToContinuousDataSource unary


logical operator.

Read up on DataSourceV2Strategy Execution Planning Strategy in The Internals


Tip
of Spark SQL book.

WriteToContinuousDataSourceExec takes the following to be created:

StreamWriter

Child physical operator ( SparkPlan )

WriteToContinuousDataSourceExec uses empty output schema (which is exactly to say that no

output is expected whatsoever).

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.continuous.WriteToContinuousDataSourceExec
happens inside.

Add the following line to conf/log4j.properties :


Tip
log4j.logger.org.apache.spark.sql.execution.streaming.continuous.WriteToContinuousDataSourceE

Refer to Logging.

Executing Physical Operator (Generating


RDD[InternalRow]) —  doExecute Method

doExecute(): RDD[InternalRow]

540
WriteToContinuousDataSourceExec Unary Physical Operator

doExecute is part of SparkPlan Contract to generate the runtime


Note representation of an physical operator as a distributed computation over internal
binary rows on Apache Spark (i.e. RDD[InternalRow] ).

doExecute requests the StreamWriter to create a DataWriterFactory .

doExecute then requests the child physical operator to execute (that gives a

RDD[InternalRow] ) and uses the RDD[InternalRow] and the DataWriterFactory to create a

ContinuousWriteRDD.

doExecute prints out the following INFO message to the logs:

Start processing data source writer: [writer]. The input RDD has [partitions] partitio
ns.

doExecute requests the EpochCoordinatorRef helper for a remote reference to the

EpochCoordinator RPC endpoint (using the __epoch_coordinator_id local property).

The EpochCoordinator RPC endpoint runs on the driver as the single point to
Note
coordinate epochs across partition tasks.

doExecute requests the EpochCoordinator RPC endpoint reference to send out a

SetWriterPartitions message synchronously.

In the end, doExecute requests the ContinuousWriteRDD to collect (which simply runs a
Spark job on all partitions in an RDD and returns the results in an array).

Requesting the ContinuousWriteRDD to collect is how a Spark job is ran that in


turn runs tasks (one per partition) that are described by the
ContinuousWriteRDD.compute method. Since executing collect is meant to
Note
run a Spark job (with tasks on executors), it’s in the discretion of the tasks
themselves to decide when to finish (so if they want to run indefinitely, so be it).
What a clever trick!

541
ContinuousWriteRDD

ContinuousWriteRDD — RDD of
WriteToContinuousDataSourceExec Unary
Physical Operator
ContinuousWriteRDD is a specialized RDD ( RDD[Unit] ) that is used exclusively as the

underlying RDD of WriteToContinuousDataSourceExec unary physical operator to write


records continuously.

ContinuousWriteRDD is created exclusively when WriteToContinuousDataSourceExec unary

physical operator is requested to execute and generate a recipe for a distributed


computation (as an RDD[InternalRow]).

ContinuousWriteRDD uses the parent RDD for the partitions and the partitioner.

ContinuousWriteRDD takes the following to be created:

Parent RDD ( RDD[InternalRow] )

Write task ( DataWriterFactory[InternalRow] )

Computing Partition —  compute Method

compute(
split: Partition,
context: TaskContext): Iterator[Unit]

Note compute is part of the RDD Contract to compute a partition.

compute requests the EpochCoordinatorRef helper for a remote reference to the

EpochCoordinator RPC endpoint (using the __epoch_coordinator_id local property).

The EpochCoordinator RPC endpoint runs on the driver as the single point to
Note
coordinate epochs across partition tasks.

compute uses the EpochTracker helper to initializeCurrentEpoch (using the

__continuous_start_epoch local property).

compute then executes the following steps (in a loop) until the task (as the given

TaskContext ) is killed or completed.

compute requests the parent RDD to compute the given partition (that gives an

Iterator[InternalRow] ).

542
ContinuousWriteRDD

compute requests the DataWriterFactory to create a DataWriter (for the partition and the

task attempt IDs from the given TaskContext and the current epoch from the EpochTracker
helper) and requests it to write all records (from the Iterator[InternalRow] ).

compute prints out the following INFO message to the logs:

Writer for partition [partitionId] in epoch [epoch] is committing.

compute requests the DataWriter to commit (that gives a WriterCommitMessage ).

compute requests the EpochCoordinator RPC endpoint reference to send out a

CommitPartitionEpoch message (with the WriterCommitMessage ).

compute prints out the following INFO message to the logs:

Writer for partition [partitionId] in epoch [epoch] is committed.

In the end (of the loop), compute uses the EpochTracker helper to incrementCurrentEpoch.

In case of an error, compute prints out the following ERROR message to the logs and
requests the DataWriter to abort.

Writer for partition [partitionId] is aborting.

In the end, compute prints out the following ERROR message to the logs:

Writer for partition [partitionId] aborted.

543
ContinuousDataSourceRDD

ContinuousDataSourceRDD — Input RDD of
DataSourceV2ScanExec Physical Operator
with ContinuousReader
ContinuousDataSourceRDD is a specialized RDD ( RDD[InternalRow] ) that is used exclusively

for the only input RDD (with the input rows) of DataSourceV2ScanExec leaf physical operator
with a ContinuousReader.

ContinuousDataSourceRDD is created exclusively when DataSourceV2ScanExec leaf physical

operator is requested for the input RDDs (which there is only one actually).

ContinuousDataSourceRDD uses spark.sql.streaming.continuous.executorQueueSize

configuration property for the size of the data queue.

ContinuousDataSourceRDD uses spark.sql.streaming.continuous.executorPollIntervalMs

configuration property for the epochPollIntervalMs.

ContinuousDataSourceRDD takes the following to be created:

SparkContext

Size of the data queue

epochPollIntervalMs

InputPartition[InternalRow] s

ContinuousDataSourceRDD uses InputPartition (of a ContinuousDataSourceRDDPartition ) for

preferred host locations (where the input partition reader can run faster).

Computing Partition —  compute Method

compute(
split: Partition,
context: TaskContext): Iterator[InternalRow]

Note compute is part of the RDD Contract to compute a given partition.

compute …​FIXME

getPartitions Method

544
ContinuousDataSourceRDD

getPartitions: Array[Partition]

Note getPartitions is part of the RDD Contract to specify the partitions to compute.

getPartitions …​FIXME

545
StreamExecution

StreamExecution — Base of Stream Execution


Engines
StreamExecution is the base of stream execution engines (aka streaming query processing

engines) that can run a structured query (on a stream execution thread).

Continuous query, streaming query, continuous Dataset, streaming


Note Dataset are all considered high-level synonyms for an executable entity that
stream execution engines run using the analyzed logical plan internally.

Table 1. StreamExecution Contract (Abstract Methods Only)


Property Description

logicalPlan: LogicalPlan

Analyzed logical plan of the streaming query to execute


Used when StreamExecution is requested to run stream
logicalPlan
processing

logicalPlan is part of ProgressReporter


Contract and the only purpose of the
Note
logicalPlan property is to change the access
level from protected to public .

runActivatedStream(
sparkSessionForStream: SparkSession): Unit

runActivatedStream Executes (runs) the activated streaming query


Used exclusively when StreamExecution is requested to run
the streaming query (when transitioning from INITIALIZING
to ACTIVE state)

Streaming Query and Stream Execution Engine

import org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isInstanceOf[StreamingQuery])

import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val se = sq.asInstanceOf[StreamingQueryWrapper].streamingQuery

scala> :type se
org.apache.spark.sql.execution.streaming.StreamExecution

546
StreamExecution

StreamExecution uses the spark.sql.streaming.minBatchesToRetain configuration property

to allow the StreamExecutions to discard old log entries (from the offset and commit logs).

Table 2. StreamExecutions
StreamExecution Description
ContinuousExecution Used in Continuous Stream Processing

MicroBatchExecution Used in Micro-Batch Stream Processing

StreamExecution does not support adaptive query execution and cost-based


Note
optimizer (and turns them off when requested to run stream processing).

StreamExecution is the execution environment of a single streaming query (aka streaming

Dataset) that is executed every trigger and in the end adds the results to a sink.

StreamExecution corresponds to a single streaming query with one or more


Note
streaming sources and exactly one streaming sink.

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val q = spark.
readStream.
format("rate").
load.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(10.minutes)).
start
scala> :type q
org.apache.spark.sql.streaming.StreamingQuery

// Pull out StreamExecution off StreamingQueryWrapper


import org.apache.spark.sql.execution.streaming.{StreamExecution, StreamingQueryWrapper
}
val se = q.asInstanceOf[StreamingQueryWrapper].streamingQuery
scala> :type se
org.apache.spark.sql.execution.streaming.StreamExecution

547
StreamExecution

Figure 1. Creating Instance of StreamExecution


DataStreamWriter describes how the results of executing batches of a
Note
streaming query are written to a streaming sink.

When started, StreamExecution starts a stream execution thread that simply runs stream
processing (and hence the streaming query).

Figure 2. StreamExecution’s Starting Streaming Query (on Execution Thread)


StreamExecution is a ProgressReporter and reports status of the streaming query (i.e. when

it starts, progresses and terminates) by posting StreamingQueryListener events.

548
StreamExecution

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.text("server-logs")
.writeStream
.format("console")
.queryName("debug")
.trigger(Trigger.ProcessingTime(20.seconds))
.start

// Enable the log level to see the INFO and DEBUG messages
// log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG

17/06/18 21:21:07 INFO StreamExecution: Starting new streaming query.


17/06/18 21:21:07 DEBUG StreamExecution: getOffset took 5 ms
17/06/18 21:21:07 DEBUG StreamExecution: Stream running from {} to {}
17/06/18 21:21:07 DEBUG StreamExecution: triggerExecution took 9 ms
17/06/18 21:21:07 DEBUG StreamExecution: Execution stats: ExecutionStats(Map(),List(),
Map())
17/06/18 21:21:07 INFO StreamExecution: Streaming query made progress: {
"id" : "8b57b0bd-fc4a-42eb-81a3-777d7ba5e370",
"runId" : "920b227e-6d02-4a03-a271-c62120258cea",
"name" : "debug",
"timestamp" : "2017-06-18T19:21:07.693Z",
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 5,
"triggerExecution" : 9
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[file:/Users/jacek/dev/oss/spark/server-logs]",
"startOffset" : null,
"endOffset" : null,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@2460208a"
}
}
17/06/18 21:21:10 DEBUG StreamExecution: Starting Trigger Calculation
17/06/18 21:21:10 DEBUG StreamExecution: getOffset took 3 ms
17/06/18 21:21:10 DEBUG StreamExecution: triggerExecution took 3 ms
17/06/18 21:21:10 DEBUG StreamExecution: Execution stats: ExecutionStats(Map(),List(),
Map())

StreamExecution tracks streaming data sources in uniqueSources internal registry.

549
StreamExecution

Figure 3. StreamExecution’s uniqueSources Registry of Streaming Data Sources


StreamExecution collects durationMs for the execution units of streaming batches.

Figure 4. StreamExecution’s durationMs

550
StreamExecution

scala> :type q
org.apache.spark.sql.streaming.StreamingQuery

scala> println(q.lastProgress)
{
"id" : "03fc78fc-fe19-408c-a1ae-812d0e28fcee",
"runId" : "8c247071-afba-40e5-aad2-0e6f45f22488",
"name" : null,
"timestamp" : "2017-08-14T20:30:00.004Z",
"batchId" : 1,
"numInputRows" : 432,
"inputRowsPerSecond" : 0.9993568953312452,
"processedRowsPerSecond" : 1380.1916932907347,
"durationMs" : {
"addBatch" : 237,
"getBatch" : 26,
"getOffset" : 0,
"queryPlanning" : 1,
"triggerExecution" : 313,
"walCommit" : 45
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]"
,
"startOffset" : 0,
"endOffset" : 432,
"numInputRows" : 432,
"inputRowsPerSecond" : 0.9993568953312452,
"processedRowsPerSecond" : 1380.1916932907347
} ],
"sink" : {
"description" : "ConsoleSink[numRows=20, truncate=true]"
}
}

StreamExecution uses OffsetSeqLog and BatchCommitLog metadata logs for write-ahead

log (to record offsets to be processed) and that have already been processed and
committed to a streaming sink, respectively.

Monitor offsets and commits metadata logs to know the progress of a


Tip
streaming query.

StreamExecution delays polling for new data for 10 milliseconds (when no data was

available to process in a batch). Use spark.sql.streaming.pollingDelay Spark property to


control the delay.

551
StreamExecution

Every StreamExecution is uniquely identified by an ID of the streaming query (which is the


id of the StreamMetadata).

Since the StreamMetadata is persisted (to the metadata file in the checkpoint
Note directory), the streaming query ID "survives" query restarts as long as the
checkpoint directory is preserved.

StreamExecution is also uniquely identified by a run ID of the streaming query. A run ID is

a randomly-generated 128-bit universally unique identifier (UUID) that is assigned at the


time StreamExecution is created.

runId does not "survive" query restarts and will always be different yet unique
Note
(across all active queries).

The name, id and runId are all unique across all active queries (in a
StreamingQueryManager). The difference is that:
name is optional and user-defined
Note id is a UUID that is auto-generated at the time StreamExecution is created
and persisted to metadata checkpoint file
runId is a UUID that is auto-generated every time StreamExecution is
created

StreamExecution uses a StreamMetadata that is persisted in the metadata file in the

checkpoint directory. If the metadata file is available it is read and is the way to recover the
ID of a streaming query when resumed (i.e. restarted after a failure or a planned stop).

StreamExecution uses __is_continuous_processing local property (default: false ) to

differentiate between ContinuousExecution ( true ) and MicroBatchExecution ( false )


which is used when StateStoreRDD is requested to compute a partition (and finds a
StateStore for a given version).

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.StreamExecution to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=ALL

Refer to Logging.

Creating StreamExecution Instance

552
StreamExecution

StreamExecution takes the following to be created:

SparkSession

Name of the streaming query (can also be null )

Path of the checkpoint directory (aka metadata directory)

Streaming query (as an analyzed logical query plan, i.e. LogicalPlan )

Streaming sink

Trigger

Clock

Output mode

deleteCheckpointOnStop flag (to control whether to delete the checkpoint directory on


stop)

StreamExecution initializes the internal properties.

StreamExecution is a Scala abstract class and cannot be created directly. It is


Note
created indirectly when the concrete StreamExecutions are.

Write-Ahead Log (WAL) of Offsets —  offsetLog Property

offsetLog: OffsetSeqLog

offsetLog is a Hadoop DFS-based metadata storage (of OffsetSeqs) with offsets

metadata directory.

offsetLog is used as Write-Ahead Log of Offsets to persist offsets of the data about to be

processed in every trigger.

Metadata log or metadata checkpoint are synonyms and are often used
Note
interchangeably.

The number of entries in the OffsetSeqLog is controlled using


spark.sql.streaming.minBatchesToRetain configuration property (default: 100 ). Stream
execution engines discard (purge) offsets from the offsets metadata log when the current
batch ID (in MicroBatchExecution) or the epoch committed (in ContinuousExecution) is
above the threshold.

553
StreamExecution

offsetLog is used when:

ContinuousExecution stream execution engine is requested to commit an

Note epoch, getStartOffsets, and addOffset


MicroBatchExecution stream execution engine is requested to populate
start offsets and construct (or skip) the next streaming micro-batch

State of Streaming Query (Execution) —  state Property

state: AtomicReference[State]

state indicates the internal state of execution of the streaming query (as

java.util.concurrent.atomic.AtomicReference).

Table 3. States
Name Description
StreamExecution has been requested to run stream
ACTIVE processing (and is about to run the activated streaming
query)

INITIALIZING StreamExecution has been created

Used to indicate that:

MicroBatchExecution has been requested to stop

TERMINATED ContinuousExecution has been requested to stop

StreamExecution has been requested to run stream


processing (and has finished running the activated
streaming query)

Used only when ContinuousExecution is requested to run a


RECONFIGURING streaming query in continuous mode (and the
ContinuousReader indicated a need for reconfiguration)

Available Offsets (StreamProgress) 


—  availableOffsets Property

availableOffsets: StreamProgress

554
StreamExecution

availableOffsets is a collection of offsets per streaming source to track what data (by

offset) is available for processing for every streaming source in the streaming query (and
have not yet been committed).

availableOffsets works in tandem with the committedOffsets internal registry.

availableOffsets is empty when StreamExecution is created (i.e. no offsets are reported

for any streaming source in the streaming query).

availableOffsets is used when:

MicroBatchExecution stream execution engine is requested to resume and


fetch the start offsets from checkpoint, check whether new data is
available, construct the next streaming micro-batch and run a single
Note streaming micro-batch
ContinuousExecution stream execution engine is requested to commit an
epoch
StreamExecution is requested for the internal string representation

Committed Offsets (StreamProgress) 


—  committedOffsets Property

committedOffsets: StreamProgress

committedOffsets is a collection of offsets per streaming source to track what data (by

offset) has already been processed and committed (to the sink or state stores) for every
streaming source in the streaming query.

committedOffsets works in tandem with the availableOffsets internal registry.

committedOffsets is used when:

MicroBatchExecution stream execution engine is requested for the start


offsets (from checkpoint), to check whether new data is available and run a
single streaming micro-batch
Note
ContinuousExecution stream execution engine is requested for the start
offsets (from checkpoint) and to commit an epoch

StreamExecution is requested for the internal string representation

Fully-Qualified (Resolved) Path to Checkpoint Root


Directory —  resolvedCheckpointRoot Property

555
StreamExecution

resolvedCheckpointRoot: String

resolvedCheckpointRoot is a fully-qualified path of the given checkpoint root directory.

The given checkpoint root directory is defined using checkpointLocation option or the
spark.sql.streaming.checkpointLocation configuration property with queryName option.

checkpointLocation and queryName options are defined when StreamingQueryManager is

requested to create a streaming query.

resolvedCheckpointRoot is used when creating the path to the checkpoint directory and

when StreamExecution finishes running streaming batches.

resolvedCheckpointRoot is used for the logicalPlan (while transforming analyzedPlan and

planning StreamingRelation logical operators to corresponding StreamingExecutionRelation


physical operators with the streaming data sources created passing in the path to sources
directory to store checkpointing metadata).

You can see resolvedCheckpointRoot in the INFO message when StreamExecution is


started.
Tip
Starting [prettyIdString]. Use [resolvedCheckpointRoot] to store the query checkpoint.

Internally, resolvedCheckpointRoot creates a Hadoop org.apache.hadoop.fs.Path for


checkpointRoot and makes it qualified.

resolvedCheckpointRoot uses SparkSession to access SessionState for a


Note
Hadoop configuration.

Offset Commit Log —  commits Metadata Checkpoint


Directory
StreamExecution uses offset commit log (CommitLog with commits metadata checkpoint

directory) for streaming batches successfully executed (with a single file per batch with a file
name being the batch id) or committed epochs.

Metadata log or metadata checkpoint are synonyms and are often used
Note
interchangeably.

commitLog is used by the stream execution engines for the following:

556
StreamExecution

MicroBatchExecution is requested to run an activated streaming query (that in turn

requests to populate the start offsets at the very beginning of the streaming query
execution and later regularly every single batch)

ContinuousExecution is requested to run an activated streaming query in continuous

mode (that in turn requests to retrieve the start offsets at the very beginning of the
streaming query execution and later regularly every commit)

Last Query Execution Of Streaming Query


(IncrementalExecution) —  lastExecution Property

lastExecution: IncrementalExecution

lastExecution is part of the ProgressReporter Contract for the QueryExecution


Note
of a streaming query.

lastExecution is a IncrementalExecution (a QueryExecution of a streaming query) of the

most recent (last) execution.

lastExecution is created when the stream execution engines are requested for the

following:

MicroBatchExecution is requested to run a single streaming micro-batch (when in

queryPlanning Phase)

ContinuousExecution stream execution engine is requested to run a streaming query

(when in queryPlanning Phase)

lastExecution is used when:

StreamExecution is requested to explain a streaming query (via explainInternal)

ProgressReporter is requested to extractStateOperatorMetrics, extractExecutionStats,

and extractSourceToNumInputRows

MicroBatchExecution stream execution engine is requested to construct or skip the next

streaming micro-batch (based on StateStoreWriters in a streaming query), run a single


streaming micro-batch (when in addBatch Phase and updating watermark and
committing offsets to offset commit log)

ContinuousExecution stream execution engine is requested to run a streaming query

(when in runContinuous Phase)

For debugging query execution of streaming queries (using debugCodegen )

557
StreamExecution

Explaining Streaming Query —  explain Method

explain(): Unit (1)


explain(extended: Boolean): Unit

1. Turns the extended flag off ( false )

explain simply prints out explainInternal to the standard output.

Note explain is used when…​FIXME

explainInternal Method

explainInternal(extended: Boolean): String

explainInternal …​FIXME

explainInternal is used when:

StreamExecution is requested to explain a streaming query


Note
StreamingQueryWrapper is requested to explainInternal

Stopping Streaming Sources and Readers 


—  stopSources Method

stopSources(): Unit

stopSources requests every streaming source (in the streaming query) to stop.

In case of an non-fatal exception, stopSources prints out the following WARN message to
the logs:

Failed to stop streaming source: [source]. Resources may have leaked.

stopSources is used when:

StreamExecution is requested to run stream processing (and terminates

Note successfully or not)


ContinuousExecution is requested to run the streaming query in continuous
mode (and terminates)

558
StreamExecution

Running Stream Processing —  runStream Internal


Method

runStream(): Unit

runStream simply prepares the environment to execute the activated streaming query.

runStream is used exclusively when the stream execution thread is requested


Note to start (when DataStreamWriter is requested to start an execution of the
streaming query).

Internally, runStream sets the job group (to all the Spark jobs started by this thread) as
follows:

runId for the job group ID

getBatchDescriptionString for the job group description (to display in web UI)

interruptOnCancel flag on

runStream uses the SparkSession to access SparkContext and assign the job
group id.
Note
Read up on SparkContext.setJobGroup method in The Internals of Apache
Spark book.

runStream sets sql.streaming.queryId local property to id.

runStream requests the MetricsSystem to register the MetricsReporter when

spark.sql.streaming.metricsEnabled configuration property is on (default: off / false ).

runStream notifies StreamingQueryListeners that the streaming query has been started (by

posting a new QueryStartedEvent event with id, runId, and name).

Figure 5. StreamingQueryListener Notified about Query’s Start (onQueryStarted)

559
StreamExecution

runStream unblocks the main starting thread (by decrementing the count of the startLatch

that when 0 lets the starting thread continue).

FIXME A picture with two parallel lanes for the starting thread and daemon
Caution
one for the query.

runStream updates the status message to be Initializing sources.

runStream initializes the analyzed logical plan.

The analyzed logical plan is a lazy value in Scala and is initialized when
Note
requested the very first time.

runStream disables adaptive query execution and cost-based join optimization (by

turning spark.sql.adaptive.enabled and spark.sql.cbo.enabled configuration properties off,


respectively).

runStream creates a new "zero" OffsetSeqMetadata.

(Only when in INITIALIZING state) runStream enters ACTIVE state:

Decrements the count of initializationLatch

Executes the activated streaming query (which is different per StreamExecution, i.e.
ContinuousExecution or MicroBatchExecution).

runBatches does the main work only when first started (i.e. when state is
Note
INITIALIZING ).

runStream …​FIXME (describe the failed and stop states)

Once TriggerExecutor has finished executing batches, runBatches updates the status
message to Stopped.

TriggerExecutor finishes executing batches when batch runner returns whether


Note the streaming query is stopped or not (which is when the internal state is not
TERMINATED ).

Caution FIXME Describe catch block for exception handling

Running Stream Processing —  finally Block


runStream releases the startLatch and initializationLatch locks.

runStream stopSources.

runStream sets the state to TERMINATED.

560
StreamExecution

runStream sets the StreamingQueryStatus with the isTriggerActive and isDataAvailable

flags off ( false ).

runStream removes the stream metrics reporter from the application’s MetricsSystem .

runStream requests the StreamingQueryManager to handle termination of a streaming

query.

runStream creates a new QueryTerminatedEvent (with the id and run id of the streaming

query) and posts it.

With the deleteCheckpointOnStop flag enabled and no StreamingQueryException reported,


runStream deletes the checkpoint directory recursively.

In the end, runStream releases the terminationLatch lock.

TriggerExecutor’s Batch Runner


Batch Runner (aka batchRunner ) is an executable block executed by TriggerExecutor in
runBatches.

batchRunner starts trigger calculation.

As long as the query is not stopped (i.e. state is not TERMINATED ), batchRunner executes
the streaming batch for the trigger.

In triggerExecution time-tracking section, runBatches branches off per currentBatchId.

Table 4. Current Batch Execution per currentBatchId


currentBatchId
currentBatchId < 0
>= 0

1. populateStartOffsets
1. Constructing
2. Setting Job Description as getBatchDescriptionString the next
streaming
DEBUG Stream running from [committedOffsets] to [availableOffsets] micro-batch

If there is data available in the sources, batchRunner marks currentStatus with


isDataAvailable enabled.

561
StreamExecution

You can check out the status of a streaming query using status method.

scala> spark.streams.active(0).status
res1: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
Note "message" : "Waiting for next trigger",
"isDataAvailable" : false,
"isTriggerActive" : false
}

batchRunner then updates the status message to Processing new data and runs the

current streaming batch.

Figure 6. StreamExecution’s Running Batches (on Execution Thread)


After triggerExecution section has finished, batchRunner finishes the streaming batch for
the trigger (and collects query execution statistics).

When there was data available in the sources, batchRunner updates committed offsets (by
adding the current batch id to BatchCommitLog and adding availableOffsets to
committedOffsets).

You should see the following DEBUG message in the logs:

DEBUG batch $currentBatchId committed

batchRunner increments the current batch id and sets the job description for all the following

Spark jobs to include the new batch id.

When no data was available in the sources to process, batchRunner does the following:

562
StreamExecution

1. Marks currentStatus with isDataAvailable disabled

2. Updates the status message to Waiting for data to arrive

3. Sleeps the current thread for pollingDelayMs milliseconds.

batchRunner updates the status message to Waiting for next trigger and returns whether

the query is currently active or not (so TriggerExecutor can decide whether to finish
executing the batches or not)

Starting Streaming Query (on Stream Execution Thread) 


—  start Method

start(): Unit

When called, start prints out the following INFO message to the logs:

Starting [prettyIdString]. Use [resolvedCheckpointRoot] to store the query checkpoint.

start then starts the stream execution thread (as a daemon thread).

start uses Java’s java.lang.Thread.start to run the streaming query on a


Note
separate execution thread.

Note When started, a streaming query runs in its own execution thread on JVM.

In the end, start pauses the main thread (using the startLatch until StreamExecution is
requested to run the streaming query that in turn sends a QueryStartedEvent to all
streaming listeners followed by decrementing the count of the startLatch).

start is used exclusively when StreamingQueryManager is requested to start a


Note streaming query (when DataStreamWriter is requested to start an execution of
the streaming query).

Path to Checkpoint Directory —  checkpointFile


Internal Method

checkpointFile(name: String): String

checkpointFile gives the path of a directory with name in checkpoint directory.

Note checkpointFile uses Hadoop’s org.apache.hadoop.fs.Path .

563
StreamExecution

checkpointFile is used for streamMetadata, OffsetSeqLog, BatchCommitLog,


Note
and lastExecution (for runBatch).

Posting StreamingQueryListener Event —  postEvent


Method

postEvent(event: StreamingQueryListener.Event): Unit

Note postEvent is a part of ProgressReporter Contract.

postEvent simply requests the StreamingQueryManager to post the input event (to the

StreamingQueryListenerBus in the current SparkSession ).

Note postEvent uses SparkSession to access the current StreamingQueryManager .

postEvent is used when:

ProgressReporter reports update progress (while finishing a trigger)

Note StreamExecution runs streaming batches (and announces starting a


streaming query by posting a QueryStartedEvent and query termination by
posting a QueryTerminatedEvent)

Waiting Until No New Data Available in Sources or Query


Has Been Terminated —  processAllAvailable Method

processAllAvailable(): Unit

Note processAllAvailable is a part of StreamingQuery Contract.

processAllAvailable reports the StreamingQueryException if reported (and returns

immediately).

streamDeathCause is reported exclusively when StreamExecution is requested


Note
to run stream execution (that terminated with an exception).

processAllAvailable returns immediately when StreamExecution is no longer active (in

TERMINATED state).

processAllAvailable acquires a lock on the awaitProgressLock and turns the noNewData

internal flag off ( false ).

564
StreamExecution

processAllAvailable keeps polling with 10-second pauses (locked on

awaitProgressLockCondition) until noNewData flag is turned on ( true ) or StreamExecution


is no longer active (in TERMINATED state).

Note The 10-second pause is hardcoded and cannot be changed.

In the end, processAllAvailable releases awaitProgressLock lock.

processAllAvailable throws an IllegalStateException when executed on the stream

execution thread:

Cannot wait for a query state from the same thread that is running the query

Stream Execution Thread —  queryExecutionThread


Property

queryExecutionThread: QueryExecutionThread

queryExecutionThread is a Java thread of execution (java.util.Thread) that runs a streaming

query.

queryExecutionThread is started (as a daemon thread) when StreamExecution is requested

to start. At that time, start prints out the following INFO message to the logs (with the
prettyIdString and the resolvedCheckpointRoot):

Starting [prettyIdString]. Use [resolvedCheckpointRoot] to store the query checkpoint.

When started, queryExecutionThread sets the call site and runs the streaming query.

queryExecutionThread uses the name stream execution thread for [id] (that uses

prettyIdString for the id, i.e. queryName [id = [id], runId = [runId]] ).

queryExecutionThread is a QueryExecutionThread that is a custom UninterruptibleThread

from Apache Spark with runUninterruptibly method for running a block of code without
being interrupted by Thread.interrupt() .

Use Java’s jconsole or jstack to monitor stream execution threads.

Tip $ jstack <driver-pid> | grep -e "stream execution thread"


"stream execution thread for kafka-topic1 [id =...

565
StreamExecution

Internal String Representation —  toDebugString


Internal Method

toDebugString(includeLogicalPlan: Boolean): String

toDebugString …​FIXME

toDebugString is used exclusively when StreamExecution is requested to run


Note
stream processing (and an exception is caught).

Current Batch Metadata (Event-Time Watermark and


Timestamp) —  offsetSeqMetadata Internal Property

offsetSeqMetadata: OffsetSeqMetadata

offsetSeqMetadata is a OffsetSeqMetadata.

offsetSeqMetadata is part of the ProgressReporter Contract to hold the current


Note
event-time watermark and timestamp.

offsetSeqMetadata is used to create an IncrementalExecution in the queryPlanning phase

of the MicroBatchExecution and ContinuousExecution execution engines.

offsetSeqMetadata is initialized (with 0 for batchWatermarkMs and batchTimestampMs )

when StreamExecution is requested to run stream processing.

offsetSeqMetadata is then updated (with the current event-time watermark and timestamp)

when MicroBatchExecution is requested to construct the next streaming micro-batch.

MicroBatchExecution uses the WatermarkTracker for the current event-time


Note
watermark and the trigger clock for the current batch timestamp.

offsetSeqMetadata is stored (checkpointed) in walCommit phase of MicroBatchExecution

(and printed out as INFO message to the logs).

FIXME INFO message

offsetSeqMetadata is restored (re-created) from a checkpointed state when

MicroBatchExecution is requested to populate start offsets.

isActive Method

566
StreamExecution

isActive: Boolean

isActive is part of the StreamingQuery Contract to indicate whether a


Note
streaming query is active ( true ) or not ( false ).

isActive is enabled ( true ) as long as the State is not TERMINATED.

exception Method

exception: Option[StreamingQueryException]

exception is part of the StreamingQuery Contract to indicate whether a


Note
streaming query…​FIXME

exception …​FIXME

Human-Readable HTML Description of Spark Jobs (for web


UI) —  getBatchDescriptionString Method

getBatchDescriptionString: String

getBatchDescriptionString is a human-readable description (in HTML format) that uses the

optional name if defined, the id, the runId and batchDescription that can be init (for the
current batch ID negative) or the current batch ID itself.

getBatchDescriptionString is of the following format:

[name]<br/>id = [id]<br/>runId = [runId]<br/>batch =


[batchDescription]

Figure 7. Monitoring Streaming Query using web UI (Spark Jobs)

567
StreamExecution

getBatchDescriptionString is used when:

MicroBatchExecution stream execution engine is requested to run an


activated streaming query (as the job description of any Spark jobs
Note triggerred as part of query execution)
StreamExecution is requested to run stream processing (as the job group
description of any Spark jobs triggerred as part of query execution)

No New Data Available —  noNewData Internal Flag

noNewData: Boolean

noNewData is a flag that indicates that a batch has completed with no new data left and

processAllAvailable could stop waiting till all streaming data is processed.

Default: false

Turned on ( true ) when:

MicroBatchExecution stream execution engine is requested to construct or skip the next

streaming micro-batch (while skipping the next micro-batch)

ContinuousExecution stream execution engine is requested to addOffset

Turned off ( false ) when:

MicroBatchExecution stream execution engine is requested to construct or skip the next

streaming micro-batch (right after the walCommit phase)

StreamExecution is requested to processAllAvailable

Internal Properties

Name Description
Java’s fair reentrant mutual exclusion
awaitProgressLock
java.util.concurrent.locks.ReentrantLock (that favors
granting access to the longest-waiting thread under
contention)

awaitProgressLockCondition Lock

callSite

Current batch ID

568
StreamExecution

Starts at -1 when StreamExecution is created


0 when StreamExecution populates start offsets
(and OffsetSeqLog is empty, i.e. no offset files in
currentBatchId
offsets directory in checkpoint)

Incremented when StreamExecution runs


streaming batches and finishes a trigger that had
data available from sources (right after committing
the batch).

initializationLatch

newData: Map[BaseStreamingSource, LogicalPlan]

Registry of the streaming sources (in the logical query


plan) that have new data available in the current batch.
The new data is a streaming DataFrame .

newData is part of the ProgressReporter


Note
newData Contract.

Set exclusively when StreamExecution is requested to


requests unprocessed data from streaming sources
(while running a single streaming batch).
Used exclusively when StreamExecution is requested to
transform the logical plan (of the streaming query) to
include the Sources and the MicroBatchReaders with
new data (while running a single streaming batch).

Time delay before polling new data again when no data


was available

pollingDelayMs
Set to spark.sql.streaming.pollingDelay Spark property.

Used when StreamExecution has started running


streaming batches (and no data was available to
process in a trigger).

Pretty-identified string for identification in logs (with


name if defined).

prettyIdString
queryName [id = xyz, runId = abc]

[id = xyz, runId = abc]

Java’s java.util.concurrent.CountDownLatch with count


1 .

startLatch

569
StreamExecution

Used when StreamExecution is requested to start to


pause the main thread until StreamExecution was
requested to run the streaming query.

streamDeathCause StreamingQueryException

MetricsReporter with spark.streaming.[name or id]


streamMetrics source name
Uses name if defined (can be null ) or falls back to id

Unique streaming sources (after being collected as


StreamingExecutionRelation from the logical query
plan).

StreamingExecutionRelation is a leaf
logical operator (i.e. LogicalPlan ) that
represents a streaming data source (and
Note
corresponds to a single StreamingRelation
uniqueSources
in analyzed logical query plan of a
streaming Dataset).

Used when StreamExecution :


Constructs the next streaming micro-batch (and
gets new offsets for every streaming data source)
Stops all streaming data sources

570
StreamingQueryWrapper

StreamingQueryWrapper — Serializable
StreamExecution
StreamingQueryWrapper is a serializable interface of a StreamExecution.

Demo: Any Streaming Query is StreamingQueryWrapper

import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val query = spark
.readStream
.format("rate")
.load
.writeStream
.format("memory")
.queryName("rate2memory")
.start
assert(query.isInstanceOf[StreamingQueryWrapper])

StreamingQueryWrapper has the same StreamExecution API and simply passes all the

method calls along to the underlying StreamExecution.

StreamingQueryWrapper is created when StreamingQueryManager is requested to create a

streaming query (when DataStreamWriter is requested to start an execution of the


streaming query).

571
TriggerExecutor

TriggerExecutor
TriggerExecutor is the interface for trigger executors that StreamExecution uses to

execute a batch runner.

Batch runner is an executable code that is executed at regular intervals. It is


Note
also called a trigger handler.

package org.apache.spark.sql.execution.streaming

trait TriggerExecutor {
def execute(batchRunner: () => Boolean): Unit
}

StreamExecution reports a IllegalStateException when TriggerExecutor is


Note different from the two built-in implementations: OneTimeExecutor or
ProcessingTimeExecutor .

Table 1. TriggerExecutor’s Available Implementations


TriggerExecutor Description
OneTimeExecutor Executes batchRunner exactly once.

Executes batchRunner at regular intervals (as defined


using ProcessingTime and DataStreamWriter.trigger
method).

ProcessingTimeExecutor(
ProcessingTimeExecutor processingTime: ProcessingTime,
clock: Clock = new SystemClock())

Processing terminates when batchRunner


Note
returns false .

notifyBatchFallingBehind Method

Caution FIXME

572
IncrementalExecution

IncrementalExecution — QueryExecution of
Streaming Queries
IncrementalExecution is the QueryExecution of streaming queries.

Tip Read up on QueryExecution in The Internals of Spark SQL book.

IncrementalExecution is created (and becomes the StreamExecution.lastExecution) when:

MicroBatchExecution is requested to run a single streaming micro-batch (in

queryPlanning phase)

ContinuousExecution is requested to run a streaming query in continuous mode (in

queryPlanning phase)

Dataset.explain operator is executed (on a streaming query)

IncrementalExecution uses the statefulOperatorId internal counter for the IDs of the

stateful operators in the optimized logical plan (while applying the preparations rules) when
requested to prepare the plan for execution (in executedPlan phase).

Preparing Logical Plan (of Streaming Query) for Execution 


—  optimizedPlan and executedPlan Phases of
Query Execution
When requested for the optimized logical plan (of the logical plan), IncrementalExecution
transforms CurrentBatchTimestamp and ExpressionWithRandomSeed expressions with the
timestamp literal and new random seeds, respectively. When transforming
CurrentBatchTimestamp expressions, IncrementalExecution prints out the following INFO

message to the logs:

Current batch timestamp = [timestamp]

Once created, IncrementalExecution is immediately executed (by the MicroBatchExecution


and ContinuousExecution stream execution engines in the queryPlanning phase) and so
the entire query execution pipeline is executed up to and including executedPlan. That
means that the extra planning strategies and the state preparation rule have been applied at
this point and the streaming query is ready for execution.

Creating IncrementalExecution Instance

573
IncrementalExecution

IncrementalExecution takes the following to be created:

SparkSession

Logical plan ( LogicalPlan )

OutputMode (as specified using DataStreamWriter.outputMode method)

State checkpoint location

Run ID of a streaming query ( UUID )

Batch ID

OffsetSeqMetadata

State Checkpoint Location (Directory)


When created, IncrementalExecution is given the checkpoint location.

For the two available execution engines (MicroBatchExecution and ContinuousExecution),


the checkpoint location is actually state directory under the checkpoint root directory.

val queryName = "rate2memory"


val checkpointLocation = s"file:/tmp/checkpoint-$queryName"
val query = spark
.readStream
.format("rate")
.load
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.start

// Give the streaming query a moment (one micro-batch)


// So lastExecution is available for the checkpointLocation
import scala.concurrent.duration._
query.awaitTermination(1.second.toMillis)

import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val stateCheckpointDir = query
.asInstanceOf[StreamingQueryWrapper]
.streamingQuery
.lastExecution
.checkpointLocation
val stateDir = s"$checkpointLocation/state"
assert(stateCheckpointDir equals stateDir)

574
IncrementalExecution

State checkpoint location is used exclusively when IncrementalExecution is requested for


the state info of the next stateful operator (when requested to optimize a streaming physical
plan using the state preparation rule that creates the stateful physical operators:
StateStoreSaveExec, StateStoreRestoreExec, StreamingDeduplicateExec,
FlatMapGroupsWithStateExec, StreamingSymmetricHashJoinExec, and
StreamingGlobalLimitExec).

Number of State Stores (spark.sql.shuffle.partitions) 


—  numStateStores Internal Property

numStateStores: Int

numStateStores is the number of state stores which corresponds to

spark.sql.shuffle.partitions configuration property (default: 200 ).

Read up on spark.sql.shuffle.partitions configuration property (and the others) in


Tip
The Internals of Spark SQL book.

Internally, numStateStores requests the OffsetSeqMetadata for the


spark.sql.shuffle.partitions configuration property (using the streaming configuration) or
simply takes whatever was defined for the given SparkSession (default: 200 ).

numStateStores is initialized right when IncrementalExecution is created.

numStateStores is used exclusively when IncrementalExecution is requested for the state

info of the next stateful operator (when requested to optimize a streaming physical plan
using the state preparation rule that creates the stateful physical operators:
StateStoreSaveExec, StateStoreRestoreExec, StreamingDeduplicateExec,
FlatMapGroupsWithStateExec, StreamingSymmetricHashJoinExec, and
StreamingGlobalLimitExec).

Extra Planning Strategies for Streaming Queries 


—  planner Property
IncrementalExecution uses a custom SparkPlanner with the following extra planning

strategies to plan the streaming query for execution:

StreamingJoinStrategy

StatefulAggregationStrategy

FlatMapGroupsWithStateStrategy

575
IncrementalExecution

StreamingRelationStrategy

StreamingDeduplicationStrategy

StreamingGlobalLimitStrategy

Tip Read up on SparkPlanner in The Internals of Spark SQL book.

State Preparation Rule For Execution-Specific


Configuration —  state Property

state: Rule[SparkPlan]

state is a custom physical preparation rule ( Rule[SparkPlan] ) that can transform a

streaming physical plan ( SparkPlan ) with the following physical operators:

StateStoreSaveExec with any unary physical operator ( UnaryExecNode ) with a


StateStoreRestoreExec

StreamingDeduplicateExec

FlatMapGroupsWithStateExec

StreamingSymmetricHashJoinExec

StreamingGlobalLimitExec

state simply transforms the physical plan with the above physical operators and fills out

the execution-specific configuration:

nextStatefulOperationStateInfo for the state info

OutputMode

batchWatermarkMs (through the OffsetSeqMetadata) for the event-time watermark

batchTimestampMs (through the OffsetSeqMetadata) for the current timestamp

getStateWatermarkPredicates for the state watermark predicates (for


StreamingSymmetricHashJoinExec)

state rule is used (as part of the physical query optimizations) when IncrementalExecution

is requested to optimize (prepare) the physical plan of the streaming query (once for
ContinuousExecution and every trigger for MicroBatchExecution in their queryPlanning
phases).

Tip Read up on Physical Query Optimizations in The Internals of Spark SQL book.

576
IncrementalExecution

nextStatefulOperationStateInfo Internal Method

nextStatefulOperationStateInfo(): StatefulOperatorStateInfo

nextStatefulOperationStateInfo simply creates a new StatefulOperatorStateInfo with the

state checkpoint location, the run ID (of the streaming query), the next statefulOperator ID,
the current batch ID, and the number of state stores.

The only changing part of StatefulOperatorStateInfo across calls of the


nextStatefulOperationStateInfo method is the the next statefulOperator ID.

All the other properties (the state checkpoint location, the run ID, the current
batch ID, and the number of state stores) are the same within a single
Note IncrementalExecution instance.

The only two properties that may ever change are the run ID (after a streaming
query is restarted from the checkpoint) and the current batch ID (every micro-
batch in MicroBatchExecution execution engine).

nextStatefulOperationStateInfo is used exclusively when


IncrementalExecution is requested to optimize a streaming physical plan using
the state preparation rule (and creates the stateful physical operators:
Note
StateStoreSaveExec, StateStoreRestoreExec, StreamingDeduplicateExec,
FlatMapGroupsWithStateExec, StreamingSymmetricHashJoinExec, and
StreamingGlobalLimitExec).

Checking Out Whether Last Execution Requires Another


Non-Data Micro-Batch —  shouldRunAnotherBatch
Method

shouldRunAnotherBatch(newMetadata: OffsetSeqMetadata): Boolean

shouldRunAnotherBatch is positive ( true ) if there is at least one StateStoreWriter operator

(in the executedPlan physical query plan) that requires another non-data batch (per the
given OffsetSeqMetadata with the event-time watermark and the batch timestamp).

Otherwise, shouldRunAnotherBatch is negative ( false ).

shouldRunAnotherBatch is used exclusively when MicroBatchExecution is


Note requested to construct the next streaming micro-batch (and checks out whether
the last batch execution requires another non-data batch).

Demo: State Checkpoint Directory

577
IncrementalExecution

// START: Only for easier debugging


// The state is then only for one partition
// which should make monitoring easier
import org.apache.spark.sql.internal.SQLConf.SHUFFLE_PARTITIONS
spark.sessionState.conf.setConf(SHUFFLE_PARTITIONS, 1)

assert(spark.sessionState.conf.numShufflePartitions == 1)
// END: Only for easier debugging

val counts = spark


.readStream
.format("rate")
.load
.groupBy(window($"timestamp", "5 seconds") as "group")
.agg(count("value") as "value_count") // <-- creates an Aggregate logical operator
.orderBy("group") // <-- makes for easier checking

assert(counts.isStreaming, "This should be a streaming query")

// Search for "checkpoint = <unknown>" in the following output


// Looks for StateStoreSave and StateStoreRestore
scala> counts.explain
== Physical Plan ==
*(5) Sort [group#5 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(group#5 ASC NULLS FIRST, 1)
+- *(4) HashAggregate(keys=[window#11], functions=[count(value#1L)])
+- StateStoreSave [window#11], state info [ checkpoint = <unknown>, runId = 558b
f725-accb-487d-97eb-f790fa4a6138, opId = 0, ver = 0, numPartitions = 1], Append, 0, 2
+- *(3) HashAggregate(keys=[window#11], functions=[merge_count(value#1L)])
+- StateStoreRestore [window#11], state info [ checkpoint = <unknown>, run
Id = 558bf725-accb-487d-97eb-f790fa4a6138, opId = 0, ver = 0, numPartitions = 1], 2
+- *(2) HashAggregate(keys=[window#11], functions=[merge_count(value#1L
)])
+- Exchange hashpartitioning(window#11, 1)
+- *(1) HashAggregate(keys=[window#11], functions=[partial_count(
value#1L)])
+- *(1) Project [named_struct(start, precisetimestampconversio
n(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#0, TimestampType
, LongType) - 0) as double) / 5000000.0)) as double) = (cast((precisetimestampconversi
on(timestamp#0, TimestampType, LongType) - 0) as double) / 5000000.0)) THEN (CEIL((cas
t((precisetimestampconversion(timestamp#0, TimestampType, LongType) - 0) as double) /
5000000.0)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#0, TimestampType
, LongType) - 0) as double) / 5000000.0)) END + 0) - 1) * 5000000) + 0), LongType, Tim
estampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((preciseti
mestampconversion(timestamp#0, TimestampType, LongType) - 0) as double) / 5000000.0))
as double) = (cast((precisetimestampconversion(timestamp#0, TimestampType, LongType) -
0) as double) / 5000000.0)) THEN (CEIL((cast((precisetimestampconversion(timestamp#0,
TimestampType, LongType) - 0) as double) / 5000000.0)) + 1) ELSE CEIL((cast((preciseti
mestampconversion(timestamp#0, TimestampType, LongType) - 0) as double) / 5000000.0))
END + 0) - 1) * 5000000) + 5000000), LongType, TimestampType)) AS window#11, value#1L]
+- *(1) Filter isnotnull(timestamp#0)
+- StreamingRelation rate, [timestamp#0, value#1L]

578
IncrementalExecution

// Start the query to access lastExecution that has the checkpoint resolved
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val t = Trigger.ProcessingTime(1.hour) // should be enough time for exploration
val sq = counts
.writeStream
.format("console")
.option("truncate", false)
.option("checkpointLocation", "/tmp/spark-streams-state-checkpoint-root")
.trigger(t)
.outputMode(OutputMode.Complete)
.start

// wait till the first batch which should happen right after start

import org.apache.spark.sql.execution.streaming._
val lastExecution = sq.asInstanceOf[StreamingQueryWrapper].streamingQuery.lastExecutio
n
scala> println(lastExecution.checkpointLocation)
file:/tmp/spark-streams-state-checkpoint-root/state

579
StreamingQueryListenerBus

StreamingQueryListenerBus — Event Bus for


Streaming Events
StreamingQueryListenerBus is an event bus ( ListenerBus[StreamingQueryListener,

StreamingQueryListener.Event] ) for dispatching streaming life-cycle events of active

streaming queries (that eventually are delivered to StreamingQueryListeners).

StreamingQueryListenerBus is created for StreamingQueryManager (once per

SparkSession ).

Figure 1. StreamingQueryListenerBus is Created Once In SparkSession


StreamingQueryListenerBus is also a SparkListener and registers itself with the

LiveListenerBus (of the SparkSession ) to intercept QueryStartedEvents.

Creating StreamingQueryListenerBus Instance


StreamingQueryListenerBus takes the following when created:

LiveListenerBus

StreamingQueryListenerBus registers itself with the LiveListenerBus.

Run IDs of Active Streaming Queries

activeQueryRunIds: HashSet[UUID]

580
StreamingQueryListenerBus

activeQueryRunIds is an internal registry of run IDs of active streaming queries in the

SparkSession .

A runId is added when StreamingQueryListenerBus is requested to post a


QueryStartedEvent

A runId is removed when StreamingQueryListenerBus is requested to post a


QueryTerminatedEvent

activeQueryRunIds is used internally to dispatch a streaming event to a

StreamingQueryListener (so the events gets sent out to streaming queries in the
SparkSession ).

Posting Streaming Event to LiveListenerBus —  post


Method

post(event: StreamingQueryListener.Event): Unit

post simply posts the input event directly to the LiveListenerBus unless it is a

QueryStartedEvent.

For a QueryStartedEvent, post adds the runId (of the streaming query that has been
started) to the activeQueryRunIds internal registry first, posts the event to the
LiveListenerBus and then postToAll.

post is used exclusively when StreamingQueryManager is requested to post a


Note
streaming event.

doPostEvent Method

doPostEvent(
listener: StreamingQueryListener,
event: StreamingQueryListener.Event): Unit

doPostEvent is part of Spark Core’s ListenerBus contract to post an event to


Note
the specified listener.

doPostEvent branches per the type of StreamingQueryListener.Event:

For a QueryStartedEvent, requests the StreamingQueryListener to onQueryStarted

For a QueryProgressEvent, requests the StreamingQueryListener to onQueryProgress

581
StreamingQueryListenerBus

For a QueryTerminatedEvent, requests the StreamingQueryListener to


onQueryTerminated

For any other event, doPostEvent simply does nothing (swallows it).

postToAll Method

postToAll(event: Event): Unit

postToAll is part of Spark Core’s ListenerBus contract to post an event to all


Note
registered listeners.

postToAll first requests the parent ListenerBus to post the event to all registered listeners.

For a QueryTerminatedEvent, postToAll simply removes the runId (of the streaming
query that has been terminated) from the activeQueryRunIds internal registry.

582
StreamMetadata

StreamMetadata
StreamMetadata is a metadata associated with a StreamingQuery (indirectly through

StreamExecution).

StreamMetadata takes an ID to be created.

StreamMetadata is created exclusively when StreamExecution is created (with a randomly-

generated 128-bit universally unique identifier (UUID)).

StreamMetadata can be persisted to and unpersisted from a JSON file. StreamMetadata

uses json4s-jackson library for JSON persistence.

import org.apache.spark.sql.execution.streaming.StreamMetadata
import org.apache.hadoop.fs.Path
val metadataPath = new Path("metadata")

scala> :type spark


org.apache.spark.sql.SparkSession

val hadoopConf = spark.sessionState.newHadoopConf()


val sm = StreamMetadata.read(metadataPath, hadoopConf)

scala> :type sm
Option[org.apache.spark.sql.execution.streaming.StreamMetadata]

Unpersisting StreamMetadata (from JSON File) —  read


Object Method

read(
metadataFile: Path,
hadoopConf: Configuration): Option[StreamMetadata]

read unpersists StreamMetadata from the given metadataFile file if available.

read returns a StreamMetadata if the metadata file was available and the content could be

read in JSON format. Otherwise, read returns None .

Note read uses org.json4s.jackson.Serialization.read for JSON deserialization.

read is used exclusively when StreamExecution is created (and tries to read


Note
the metadata checkpoint file).

583
StreamMetadata

Persisting Metadata —  write Object Method

write(
metadata: StreamMetadata,
metadataFile: Path,
hadoopConf: Configuration): Unit

write persists the given StreamMetadata to the given metadataFile file in JSON format.

Note write uses org.json4s.jackson.Serialization.write for JSON serialization.

write is used exclusively when StreamExecution is created (and the


Note
metadata checkpoint file is not available).

584
EventTimeWatermark Unary Logical Operator

EventTimeWatermark Unary Logical Operator 


— Streaming Watermark
EventTimeWatermark is a unary logical operator that is created to represent

Dataset.withWatermark operator in a logical query plan of a streaming query.

A unary logical operator ( UnaryNode ) is a logical operator with a single child


logical operator.
Note
Read up on UnaryNode (and logical operators in general) in The Internals of
Spark SQL book.

When requested for the output attributes, EventTimeWatermark logical operator goes over the
output attributes of the child logical operator to find the matching attribute based on the
eventTime attribute and updates it to include spark.watermarkDelayMs metadata key with the
watermark delay interval (converted to milliseconds).

EventTimeWatermark is resolved (planned) to EventTimeWatermarkExec physical operator in

StatefulAggregationStrategy execution planning strategy.

EliminateEventTimeWatermark logical optimization rule (i.e. Rule[LogicalPlan] ) removes


EventTimeWatermark logical operator from a logical plan if the child logical operator is not
streaming, i.e. when Dataset.withWatermark operator is used on a batch query.

val logs = spark.


read. // <-- batch non-streaming query that makes `EliminateEventTimeWatermark` rule ap
plicable
format("text").
load("logs")
Note
// logs is a batch Dataset
assert(!logs.isStreaming)

val q = logs.
withWatermark(eventTime = "timestamp", delayThreshold = "30 seconds") // <-- creates Ev
entTimeWatermark
scala> println(q.queryExecution.logical.numberedTreeString) // <-- no EventTimeWatermark
as it was removed immediately
00 Relation[value#0] text

Creating EventTimeWatermark Instance


EventTimeWatermark takes the following to be created:

Watermark column ( Attribute )

Watermark delay ( CalendarInterval )

585
EventTimeWatermark Unary Logical Operator

Child logical operator ( LogicalPlan )

Output Schema —  output Property

output: Seq[Attribute]

output is part of the QueryPlan Contract to describe the attributes of (the


Note
schema of) the output.

output finds eventTime column in the output schema of the child logical operator and

updates the Metadata of the column with spark.watermarkDelayMs key and the
milliseconds for the delay.

output removes spark.watermarkDelayMs key from the other columns.

// FIXME How to access/show the eventTime column with the metadata updated to include
spark.watermarkDelayMs?
import org.apache.spark.sql.catalyst.plans.logical.EventTimeWatermark
val etw = q.queryExecution.logical.asInstanceOf[EventTimeWatermark]
scala> etw.output.toStructType.printTreeString
root
|-- timestamp: timestamp (nullable = true)
|-- value: long (nullable = true)

Watermark Metadata (Marker) 


—  spark.watermarkDelayMs Metadata Key
spark.watermarkDelayMs metadata key is used to mark one of the output attributes as the

watermark attribute (eventTime watermark).

Converting Human-Friendly CalendarInterval to


Milliseconds —  getDelayMs Object Method

getDelayMs(
delay: CalendarInterval): Long

getDelayMs …​FIXME

Note getDelayMs is used when…​FIXME

586
EventTimeWatermark Unary Logical Operator

587
FlatMapGroupsWithState Unary Logical Operator

FlatMapGroupsWithState Unary Logical


Operator
FlatMapGroupsWithState is a unary logical operator that is created to represent the following

operators in a logical query plan of a streaming query:

KeyValueGroupedDataset.mapGroupsWithState

KeyValueGroupedDataset.flatMapGroupsWithState

A unary logical operator ( UnaryNode ) is a logical operator with a single child


logical operator.
Note
Read up on UnaryNode (and logical operators in general) in The Internals of
Spark SQL book.

FlatMapGroupsWithState is resolved (planned) to:

FlatMapGroupsWithStateExec unary physical operator for streaming datasets (in


FlatMapGroupsWithStateStrategy execution planning strategy)

MapGroupsExec physical operator for batch datasets (in BasicOperators execution

planning strategy)

Creating SerializeFromObject with


FlatMapGroupsWithState —  apply Factory Method

apply[K: Encoder, V: Encoder, S: Encoder, U: Encoder](


func: (Any, Iterator[Any], LogicalGroupState[Any]) => Iterator[Any],
groupingAttributes: Seq[Attribute],
dataAttributes: Seq[Attribute],
outputMode: OutputMode,
isMapGroupsWithState: Boolean,
timeout: GroupStateTimeout,
child: LogicalPlan): LogicalPlan

apply creates a SerializeFromObject logical operator with a FlatMapGroupsWithState as its

child logical operator.

Internally, apply creates SerializeFromObject object consumer (aka unary logical


operator) with FlatMapGroupsWithState logical plan.

588
FlatMapGroupsWithState Unary Logical Operator

Internally, apply finds ExpressionEncoder for the type S and creates a


FlatMapGroupsWithState with UnresolvedDeserializer for the types K and V .

In the end, apply creates a SerializeFromObject object consumer with the


FlatMapGroupsWithState .

Note apply is used in KeyValueGroupedDataset.flatMapGroupsWithState operator.

Creating FlatMapGroupsWithState Instance


FlatMapGroupsWithState takes the following to be created:

State function ( (Any, Iterator[Any], LogicalGroupState[Any]) ⇒ Iterator[Any] )

Key deserializer Catalyst expression

Value deserializer Catalyst expression

Grouping attributes

Data attributes

Output object attribute

State ExpressionEncoder

Output mode

isMapGroupsWithState flag (default: false )

GroupStateTimeout

Child logical operator ( LogicalPlan )

589
Deduplicate Unary Logical Operator

Deduplicate Unary Logical Operator


Deduplicate is a unary logical operator (i.e. LogicalPlan ) that is created to represent

dropDuplicates operator (that drops duplicate records for a given subset of columns).

Deduplicate has streaming flag enabled for streaming Datasets.

val uniqueRates = spark.


readStream.
format("rate").
load.
dropDuplicates("value") // <-- creates Deduplicate logical operator
// Note the streaming flag
scala> println(uniqueRates.queryExecution.logical.numberedTreeString)
00 Deduplicate [value#33L], true // <-- streaming flag enabled
01 +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@4785f176,rate,List
(),None,List(),None,Map(),None), rate, [timestamp#32, value#33L]

FIXME Example with duplicates across batches to show that Deduplicate


Caution keeps state and withWatermark operator should also be used to limit how
much is stored (to not cause OOM)

UnsupportedOperationChecker ensures that dropDuplicates operator is not used after aggregatio

The following code is not supported in Structured Streaming and results in an AnalysisExceptio

val counts = spark.


readStream.
format("rate").
load.
groupBy(window($"timestamp", "5 seconds") as "group").
agg(count("value") as "value_count").
dropDuplicates // <-- after groupBy
Note
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = counts.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Complete).
start
org.apache.spark.sql.AnalysisException: dropDuplicates is not supported after aggregation o

590
Deduplicate Unary Logical Operator

Deduplicate logical operator is translated (aka planned) to:

StreamingDeduplicateExec physical operator in


StreamingDeduplicationStrategy execution planning strategy for streaming
Note Datasets (aka streaming plans)
Aggregate physical operator in ReplaceDeduplicateWithAggregate
execution planning strategy for non-streaming/batch Datasets (aka batch
plans)

The output schema of Deduplicate is exactly the child's output schema.

Creating Deduplicate Instance


Deduplicate takes the following when created:

Attributes for keys

Child logical operator (i.e. LogicalPlan )

Flag whether the logical operator is for streaming (enabled) or batch (disabled) mode

591
MemoryPlan Logical Query Plan

MemoryPlan Logical Operator

MemoryPlan is a leaf logical operator (i.e. LogicalPlan ) that is used to query the data that

has been written into a MemorySink. MemoryPlan is created when starting continuous writing
(to a MemorySink ).

Tip See the example in MemoryStream.

scala> intsOut.explain(true)
== Parsed Logical Plan ==
SubqueryAlias memstream
+- MemoryPlan org.apache.spark.sql.execution.streaming.MemorySink@481bf251, [value#21]

== Analyzed Logical Plan ==


value: int
SubqueryAlias memstream
+- MemoryPlan org.apache.spark.sql.execution.streaming.MemorySink@481bf251, [value#21]

== Optimized Logical Plan ==


MemoryPlan org.apache.spark.sql.execution.streaming.MemorySink@481bf251, [value#21]

== Physical Plan ==
LocalTableScan [value#21]

When executed, MemoryPlan is translated to LocalTableScanExec physical operator (similar


to LocalRelation logical operator) in BasicOperators execution planning strategy.

592
StreamingRelation Leaf Logical Operator for Streaming Source

StreamingRelation Leaf Logical Operator for


Streaming Source
StreamingRelation is a leaf logical operator (i.e. LogicalPlan ) that represents a streaming

source in a logical plan.

StreamingRelation is created when DataStreamReader is requested to load data from a

streaming source and creates a streaming Dataset .

Figure 1. StreamingRelation Represents Streaming Source

val rate = spark.


readStream. // <-- creates a DataStreamReader
format("rate").
load("hello") // <-- creates a StreamingRelation
scala> println(rate.queryExecution.logical.numberedTreeString)
00 StreamingRelation DataSource(org.apache.spark.sql.SparkSession@4e5dcc50,rate,List(),
None,List(),None,Map(path -> hello),None), rate, [timestamp#0, value#1L]

isStreaming flag is always enabled (i.e. true ).

import org.apache.spark.sql.execution.streaming.StreamingRelation
val relation = rate.queryExecution.logical.asInstanceOf[StreamingRelation]
scala> relation.isStreaming
res1: Boolean = true

toString gives the source name.

scala> println(relation)
rate

StreamingRelation is resolved (aka planned) to StreamingExecutionRelation


Note
(right after StreamExecution starts running batches).

Creating StreamingRelation for DataSource —  apply


Object Method

apply(dataSource: DataSource): StreamingRelation

593
StreamingRelation Leaf Logical Operator for Streaming Source

apply creates a StreamingRelation for the given DataSource (that represents a streaming

source).

apply is used exclusively when DataStreamReader is requested for a


Note
streaming DataFrame.

Creating StreamingRelation Instance


StreamingRelation takes the following when created:

DataSource

Short name of the streaming source

Output attributes of the schema of the streaming source

594
StreamingRelationV2 Leaf Logical Operator

StreamingRelationV2 Leaf Logical Operator


StreamingRelationV2 is a MultiInstanceRelation leaf logical operator that represents

MicroBatchReadSupport or ContinuousReadSupport streaming data sources in a logical


plan of a streaming query.

Tip Read up on Leaf logical operators in The Internals of Spark SQL book.

StreamingRelationV2 is created when:

DataStreamReader is requested to "load" data as a streaming DataFrame for

MicroBatchReadSupport and ContinuousReadSupport streaming data sources

ContinuousMemoryStream is created

isStreaming flag is always enabled (i.e. true ).

scala> :type sq
org.apache.spark.sql.DataFrame

import org.apache.spark.sql.execution.streaming.StreamingRelationV2
val relation = sq.queryExecution.logical.asInstanceOf[StreamingRelationV2]
assert(relation.isStreaming)

StreamingRelationV2 is resolved (replaced) to the following leaf logical operators:

ContinuousExecutionRelation when ContinuousExecution stream execution engine is


requested for the analyzed logical plan

StreamingExecutionRelation when MicroBatchExecution stream execution engine is


requested for the analyzed logical plan

Creating StreamingRelationV2 Instance


StreamingRelationV2 takes the following to be created:

DataSourceV2

Name of the data source

Options ( Map[String, String] )

Output attributes ( Seq[Attribute] )

Optional StreamingRelation

595
StreamingRelationV2 Leaf Logical Operator

SparkSession

596
StreamingExecutionRelation Leaf Logical Operator for Streaming Source At Execution

StreamingExecutionRelation Leaf Logical


Operator for Streaming Source At Execution
StreamingExecutionRelation is a leaf logical operator (i.e. LogicalPlan ) that represents a

streaming source in the logical query plan of a streaming Dataset .

The main use of StreamingExecutionRelation logical operator is to be a "placeholder" in a


logical query plan that will be replaced with the real relation (with new data that has arrived
since the last batch) or an empty LocalRelation when StreamExecution is requested to
transforming logical plan to include the Sources and MicroBatchReaders with new data.

StreamingExecutionRelation is created for a StreamingRelation in analyzed logical query

plan (that is the execution representation of a streaming Dataset).

Right after StreamExecution has started running streaming batches it initializes


the streaming sources by transforming the analyzed logical plan of the
Note streaming Dataset so that every StreamingRelation logical operator is replaced
by the corresponding StreamingExecutionRelation .

Figure 1. StreamingExecutionRelation Represents Streaming Source At Execution


StreamingExecutionRelation is also resolved (aka planned) to a
Note StreamingRelationExec physical operator in StreamingRelationStrategy
execution planning strategy only when explaining a streaming Dataset .

Creating StreamingExecutionRelation Instance


StreamingExecutionRelation takes the following when created:

597
StreamingExecutionRelation Leaf Logical Operator for Streaming Source At Execution

Streaming source

Output attributes

Creating StreamingExecutionRelation (based on a Source) 


—  apply Object Method

apply(source: Source): StreamingExecutionRelation

apply creates a StreamingExecutionRelation for the input source and with the attributes

of the schema of the source .

Note apply seems to be used for tests only.

598
EventTimeWatermarkExec

EventTimeWatermarkExec Unary Physical


Operator
EventTimeWatermarkExec is a unary physical operator that represents EventTimeWatermark

logical operator at execution time.

A unary physical operator ( UnaryExecNode ) is a physical operator with a single


child physical operator.
Note
Read up on UnaryExecNode (and physical operators in general) in The
Internals of Spark SQL book.

The purpose of the EventTimeWatermarkExec operator is to simply extract (project) the values
of the event-time watermark column and add them directly to the EventTimeStatsAccum
internal accumulator.

Since the execution (data processing) happens on Spark executors, the only
way to establish communication between the tasks (on the executors) and the
Note driver is to use an accumulator.
Read up on Accumulators in The Internals of Apache Spark book.

EventTimeWatermarkExec uses EventTimeStatsAccum internal accumulator as a way to send

the statistics (the maximum, minimum, average and update count) of the values in the event-
time watermark column that is later used in:

ProgressReporter for creating execution statistics for the most recent query execution

(for monitoring the max , min , avg , and watermark event-time watermark statistics)

StreamExecution to observe and possibly update event-time watermark when

constructing the next streaming batch.

EventTimeWatermarkExec is created exclusively when StatefulAggregationStrategy execution

planning strategy is requested to plan a logical plan with EventTimeWatermark logical


operators for execution.

Check out Demo: Streaming Watermark with Aggregation in Append Output


Tip
Mode to deep dive into the internals of Streaming Watermark.

Creating EventTimeWatermarkExec Instance


EventTimeWatermarkExec takes the following to be created:

599
EventTimeWatermarkExec

Event time column - the column with the (event) time for event-time watermark

Delay interval ( CalendarInterval )

Child physical operator ( SparkPlan )

While being created, EventTimeWatermarkExec registers the EventTimeStatsAccum internal


accumulator (with the current SparkContext ).

Executing Physical Operator (Generating


RDD[InternalRow]) —  doExecute Method

doExecute(): RDD[InternalRow]

doExecute is part of SparkPlan Contract to generate the runtime


Note representation of an physical operator as a distributed computation over internal
binary rows on Apache Spark (i.e. RDD[InternalRow] ).

Internally, doExecute executes the child physical operator and maps over the partitions
(using RDD.mapPartitions ).

doExecute creates an unsafe projection (one per partition) for the column with the event

time in the output schema of the child physical operator. The unsafe projection is to extract
event times from the (stream of) internal rows of the child physical operator.

For every row ( InternalRow ) per partition, doExecute requests the eventTimeStats
accumulator to add the event time.

Note The event time value is in seconds (not millis as the value is divided by 1000 ).

Output Attributes (Schema) —  output Property

output: Seq[Attribute]

output is part of the QueryPlan Contract to describe the attributes of (the


Note
schema of) the output.

output requests the child physical operator for the output attributes to find the event time

column and any other column with metadata that contains spark.watermarkDelayMs key.

For the event time column, output updates the metadata to include the delay interval for
the spark.watermarkDelayMs key.

600
EventTimeWatermarkExec

For any other column (not the event time column) with the spark.watermarkDelayMs key,
output simply removes the key from the metadata.

// FIXME: Would be nice to have a demo. Anyone?

Internal Properties

Name Description
Delay interval - the delay interval in milliseconds
Used when:

delayMs EventTimeWatermarkExec is requested for the output


attributes
WatermarkTracker is requested to update the event-time
watermark

EventTimeStatsAccum accumulator to accumulate


eventTime values from every row in a streaming batch
(when EventTimeWatermarkExec is executed).

EventTimeStatsAccum is a Spark accumulator of


eventTimeStats Note EventTimeStats from Longs (i.e.
AccumulatorV2[Long, EventTimeStats] ).

Every Spark accumulator has to be registered


Note before use, and eventTimeStats is registered
when EventTimeWatermarkExec is created.

601
FlatMapGroupsWithStateExec

FlatMapGroupsWithStateExec Unary Physical


Operator
FlatMapGroupsWithStateExec is a unary physical operator that represents

FlatMapGroupsWithState logical operator at execution time.

A unary physical operator ( UnaryExecNode ) is a physical operator with a single


child physical operator.
Note
Read up on UnaryExecNode (and physical operators in general) in The
Internals of Spark SQL book.

FlatMapGroupsWithState unary logical operator represents


KeyValueGroupedDataset.mapGroupsWithState and
Note
KeyValueGroupedDataset.flatMapGroupsWithState operators in a logical query
plan.

FlatMapGroupsWithStateExec is created exclusively when FlatMapGroupsWithStateStrategy

execution planning strategy is requested to plan a FlatMapGroupsWithState logical operator


for execution.

FlatMapGroupsWithStateExec is an ObjectProducerExec physical operator and so produces a

single output object.

Read up on ObjectProducerExec — Physical Operators With Single Object


Tip
Output in The Internals of Spark SQL book.

Tip Check out Demo: Internals of FlatMapGroupsWithStateExec Physical Operator.

FlatMapGroupsWithStateExec is given an OutputMode when created, but it does


not seem to be used at all. Check out the question What’s the purpose of
Note
OutputMode in flatMapGroupsWithState? How/where is it used? on
StackOverflow.

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec=ALL

Refer to Logging.

602
FlatMapGroupsWithStateExec

Creating FlatMapGroupsWithStateExec Instance


FlatMapGroupsWithStateExec takes the following to be created:

User-defined state function that is applied to every group (of type (Any,
Iterator[Any], LogicalGroupState[Any]) ⇒ Iterator[Any] )

Key deserializer expression

Value deserializer expression

Grouping attributes (as used for grouping in KeyValueGroupedDataset for


mapGroupsWithState or flatMapGroupsWithState operators)

Data attributes

Output object attribute (that is the reference to the single object field this operator
outputs)

StatefulOperatorStateInfo

State encoder ( ExpressionEncoder[Any] )

State format version

OutputMode

GroupStateTimeout

Batch Processing Time

Event-time watermark

Child physical operator

FlatMapGroupsWithStateExec initializes the internal properties.

Performance Metrics (SQLMetrics)


FlatMapGroupsWithStateExec uses the performance metrics of StateStoreWriter.

603
FlatMapGroupsWithStateExec

Figure 1. FlatMapGroupsWithStateExec in web UI (Details for Query)

FlatMapGroupsWithStateExec as StateStoreWriter
FlatMapGroupsWithStateExec is a stateful physical operator that can write to a state store(and

MicroBatchExecution requests whether to run another batch or not based on the

GroupStateTimeout).

FlatMapGroupsWithStateExec uses the GroupStateTimeout (and possibly the updated

metadata) when asked whether to run another batch or not (when MicroBatchExecution is
requested to construct the next streaming micro-batch when requested to run the activated
streaming query).

FlatMapGroupsWithStateExec with Streaming Event-Time


Watermark Support (WatermarkSupport)
FlatMapGroupsWithStateExec is a physical operator that supports streaming event-time

watermark.

604
FlatMapGroupsWithStateExec

FlatMapGroupsWithStateExec is given the optional event time watermark when created.

The event-time watermark is initially undefined ( None ) when planned to for execution (in
FlatMapGroupsWithStateStrategy execution planning strategy).

FlatMapGroupsWithStateStrategy converts FlatMapGroupsWithState unary


logical operator to FlatMapGroupsWithStateExec physical operator with
Note undefined StatefulOperatorStateInfo, batchTimestampMs, and
eventTimeWatermark.

The event-time watermark (with the StatefulOperatorStateInfo and the batchTimestampMs)


is only defined to the current event-time watermark of the given OffsetSeqMetadata when
IncrementalExecution query execution pipeline is requested to apply the state preparation

rule (as part of the preparations rules).

The preparations rules are executed (applied to a physical query plan) at the
executedPlan phase of Structured Query Execution Pipeline to generate an
optimized physical query plan ready for execution).
Note
Read up on Structured Query Execution Pipeline in The Internals of Spark SQL
book.

IncrementalExecution is used as the lastExecution of the available streaming query

execution engines. It is created in the queryPlanning phase (of the MicroBatchExecution


and ContinuousExecution execution engines) based on the current OffsetSeqMetadata.

The optional event-time watermark can only be defined when the state
Note preparation rule is executed which is at the executedPlan phase of Structured
Query Execution Pipeline which is also part of the queryPlanning phase.

FlatMapGroupsWithStateExec and StateManager 


—  stateManager Property

stateManager: StateManager

While being created, FlatMapGroupsWithStateExec creates a StateManager (with the state


encoder and the isTimeoutEnabled flag).

A StateManager is created per state format version that is given while creating a
FlatMapGroupsWithStateExec (to choose between the available implementations).

The state format version is controlled by


spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion internal configuration
property (default: 2 ).

605
FlatMapGroupsWithStateExec

Note StateManagerImplV2 is the default StateManager .

The StateManager is used exclusively when FlatMapGroupsWithStateExec physical operator


is executed (to generate a recipe for a distributed computation as an RDD[InternalRow] ) for
the following:

State schema (for the value schema of a StateStoreRDD)

State data for a key in a StateStore while processing new data

All state data (for all keys) in a StateStore while processing timed-out state data

Removing the state for a key from a StateStore when all rows have been processed

Persisting the state for a key in a StateStore when all rows have been processed

keyExpressions Method

keyExpressions: Seq[Attribute]

Note keyExpressions is part of the WatermarkSupport Contract to…​FIXME.

keyExpressions simply returns the grouping attributes.

Executing Physical Operator (Generating


RDD[InternalRow]) —  doExecute Method

doExecute(): RDD[InternalRow]

doExecute is part of SparkPlan Contract to generate the runtime


Note representation of an physical operator as a distributed computation over internal
binary rows on Apache Spark (i.e. RDD[InternalRow] ).

doExecute first initializes the metrics (which happens on the driver).

doExecute then requests the child physical operator to execute and generate an

RDD[InternalRow] .

doExecute uses StateStoreOps to create a StateStoreRDD with a storeUpdateFunction that

does the following (for a partition):

1. Creates an InputProcessor for a given StateStore

606
FlatMapGroupsWithStateExec

2. (only when the GroupStateTimeout is EventTimeTimeout) Filters out late data based on
the event-time watermark, i.e. rows from a given Iterator[InternalRow] that are older
than the event-time watermark are excluded from the steps that follow

3. Requests the InputProcessor to create an iterator of a new data processed from the
(possibly filtered) iterator

4. Requests the InputProcessor to create an iterator of a timed-out state data

5. Creates an iterator by concatenating the above iterators (with the new data processed
first)

6. In the end, creates a CompletionIterator that executes a completion function


( completionFunction ) after it has successfully iterated through all the elements (i.e.
when a client has consumed all the rows). The completion method requests the given
StateStore to commit changes followed by setting the store-specific metrics.

Checking Out Whether Last Batch Execution Requires


Another Non-Data Batch or Not 
—  shouldRunAnotherBatch Method

shouldRunAnotherBatch(newMetadata: OffsetSeqMetadata): Boolean

shouldRunAnotherBatch is part of the StateStoreWriter Contract to indicate


whether MicroBatchExecution should run another non-data batch (based on the
Note
updated OffsetSeqMetadata with the current event-time watermark and the
batch timestamp).

shouldRunAnotherBatch uses the GroupStateTimeout as follows:

With EventTimeTimeout, shouldRunAnotherBatch is positive ( true ) only when the


event-time watermark is defined and is older (below) the event-time watermark of the
given OffsetSeqMetadata

With NoTimeout (and other GroupStateTimeouts if there were any),


shouldRunAnotherBatch is always negative ( false )

With ProcessingTimeTimeout, shouldRunAnotherBatch is always positive ( true )

Internal Properties

607
FlatMapGroupsWithStateExec

Name Description

Flag that says whether the GroupStateTimeout is not


NoTimeout
Used when:
isTimeoutEnabled
FlatMapGroupsWithStateExec is created (and creates
the internal StateManager)
InputProcessor is requested to
processTimedOutState

stateAttributes

stateDeserializer

stateSerializer

timestampTimeoutAttribute

Flag that says whether the child physical operator has a


watermark attribute (among the output attributes).
watermarkPresent
Used exclusively when InputProcessor is requested to
callFunctionAndUpdateState

608
StateStoreRestoreExec

StateStoreRestoreExec Unary Physical


Operator — Restoring Streaming State From
State Store
StateStoreRestoreExec is a unary physical operator that restores (reads) a streaming state

from a state store (for the keys from the child physical operator).

A unary physical operator ( UnaryExecNode ) is a physical operator with a single


child physical operator.
Note
Read up on UnaryExecNode (and physical operators in general) in The
Internals of Spark SQL book.

StateStoreRestoreExec is created exclusively when StatefulAggregationStrategy execution

planning strategy is requested to plan a streaming aggregation for execution ( Aggregate


logical operators in the logical plan of a streaming query).

Figure 1. StateStoreRestoreExec and StatefulAggregationStrategy


The optional StatefulOperatorStateInfo is initially undefined (i.e. when
StateStoreRestoreExec is created). StateStoreRestoreExec is updated to hold the streaming

batch-specific execution property when IncrementalExecution prepares a streaming


physical plan for execution (and state preparation rule is executed when StreamExecution
plans a streaming query for a streaming batch).

Figure 2. StateStoreRestoreExec and IncrementalExecution


When executed, StateStoreRestoreExec executes the child physical operator and creates a
StateStoreRDD to map over partitions with storeUpdateFunction that restores the state for
the keys in the input rows if available.

The output schema of StateStoreRestoreExec is exactly the child's output schema.

The output partitioning of StateStoreRestoreExec is exactly the child's output partitioning.

Performance Metrics (SQLMetrics)

609
StateStoreRestoreExec

Key Name (in UI) Description

The number of input rows from the child physical


number of
numOutputRows operator (for which StateStoreRestoreExec tried to
output rows
find the state)

610
StateStoreRestoreExec

611
StateStoreRestoreExec

Figure 3. StateStoreRestoreExec in web UI (Details for Query)

Creating StateStoreRestoreExec Instance


StateStoreRestoreExec takes the following to be created:

Key expressions, i.e. Catalyst attributes for the grouping keys

Optional StatefulOperatorStateInfo (default: None )

Version of the state format (based on the


spark.sql.streaming.aggregation.stateFormatVersion configuration property)

Child physical operator ( SparkPlan )

StateStoreRestoreExec and
StreamingAggregationStateManager —  stateManager
Property

stateManager: StreamingAggregationStateManager

stateManager is a StreamingAggregationStateManager that is created together with

StateStoreRestoreExec .

The StreamingAggregationStateManager is created for the keys, the output schema of the
child physical operator and the version of the state format.

The StreamingAggregationStateManager is used when StateStoreRestoreExec is requested to


generate a recipe for a distributed computation (as a RDD[InternalRow]) for the following:

Schema of the values in a state store

Extracting the columns for the key from the input row

Looking up the value of a key from a state store

Executing Physical Operator (Generating


RDD[InternalRow]) —  doExecute Method

doExecute(): RDD[InternalRow]

612
StateStoreRestoreExec

doExecute is part of SparkPlan Contract to generate the runtime


Note representation of an physical operator as a distributed computation over internal
binary rows on Apache Spark (i.e. RDD[InternalRow] ).

Internally, doExecute executes child physical operator and creates a StateStoreRDD with
storeUpdateFunction that does the following per child operator’s RDD partition:

1. Generates an unsafe projection to access the key field (using keyExpressions and the
output schema of child operator).

2. For every input row (as InternalRow )

Extracts the key from the row (using the unsafe projection above)

Gets the saved state in StateStore for the key if available (it might not be if the
key appeared in the input the first time)

Increments numOutputRows metric (that in the end is the number of rows from the
child operator)

Generates collection made up of the current row and possibly the state for the key
if available

The number of rows from StateStoreRestoreExec is the number of rows from


Note
the child operator with additional rows for the saved state.

There is no way in StateStoreRestoreExec to find out how many rows had


associated state available in a state store. You would have to use the
Note
corresponding StateStoreSaveExec operator’s metrics (most likely number of
total state rows but that could depend on the output mode).

613
StateStoreSaveExec

StateStoreSaveExec Unary Physical Operator 


— Saving Streaming State To State Store
StateStoreSaveExec is a unary physical operator that saves a streaming state to a state

store with support for streaming watermark.

A unary physical operator ( UnaryExecNode ) is a physical operator with a single


child physical operator.
Note
Read up on UnaryExecNode (and physical operators in general) in The
Internals of Spark SQL book.

StateStoreSaveExec is created exclusively when StatefulAggregationStrategy execution

planning strategy is requested to plan a streaming aggregation for execution ( Aggregate


logical operators in the logical plan of a streaming query).

Figure 1. StateStoreSaveExec and StatefulAggregationStrategy


The optional properties, i.e. the StatefulOperatorStateInfo, the output mode, and the event-
time watermark, are initially undefined when StateStoreSaveExec is created.
StateStoreSaveExec is updated to hold execution-specific configuration when

IncrementalExecution is requested to prepare the logical plan (of a streaming query) for

execution (when the state preparation rule is executed).

Figure 2. StateStoreSaveExec and IncrementalExecution


Unlike StateStoreRestoreExec operator, StateStoreSaveExec takes output
Note
mode and event time watermark when created.

When executed, StateStoreSaveExec creates a StateStoreRDD to map over partitions with


storeUpdateFunction that manages the StateStore .

Figure 3. StateStoreSaveExec creates StateStoreRDD

614
StateStoreSaveExec

615
StateStoreSaveExec

Figure 4. StateStoreSaveExec and StateStoreRDD (after streamingBatch.toRdd.count)


The number of partitions of StateStoreRDD (and hence the number of Spark
tasks) is what was defined for the child physical plan.
Note
There will be that many StateStores as there are partitions in StateStoreRDD .

Note StateStoreSaveExec behaves differently per output mode.

When executed, StateStoreSaveExec executes the child physical operator and creates a
StateStoreRDD (with storeUpdateFunction specific to the output mode).

The output schema of StateStoreSaveExec is exactly the child's output schema.

The output partitioning of StateStoreSaveExec is exactly the child's output partitioning.

StateStoreRestoreExec uses a StreamingAggregationStateManager (that is created for the

keyExpressions, the output of the child physical operator and the stateFormatVersion).

Enable ALL logging level for


org.apache.spark.sql.execution.streaming.StateStoreSaveExec to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.StateStoreSaveExec=ALL

Refer to Logging.

Performance Metrics (SQLMetrics)


StateStoreSaveExec uses the performance metrics as other stateful physical operators that

write to a state store.

Figure 5. StateStoreSaveExec in web UI (Details for Query)

616
StateStoreSaveExec

The following table shows how the performance metrics are computed (and so their exact
meaning).

Name (in web UI) Description


Time taken to read the input rows and store them in a state
store (possibly filtering out expired rows per
watermarkPredicateForData predicate)

The number of rows stored is the number of updated state


rows metric
For Append output mode, the time taken to filter out
expired rows (per the required
watermarkPredicateForData predicate) and the
StreamingAggregationStateManager to store rows in a
total time to update
state store
rows
For Complete output mode, the time taken to go over all
the input rows and request the
StreamingAggregationStateManager to store rows in a
state store

For Update output mode, the time taken to filter out


expired rows (per the optional
watermarkPredicateForData predicate) and the
StreamingAggregationStateManager to store rows in a
state store

For Append output mode, the time taken for the


StreamingAggregationStateManager to remove all
expired entries from a state store (per
watermarkPredicateForKeys predicate) that is the total
time of iterating over all entries in the state store (the
number of entries removed from a state store is the
total time to remove difference between the number of output rows of the
rows child operator and the number of total state rows metric)

For Complete output mode, always 0


For Update output mode, the time taken for the
StreamingAggregationStateManager to remove all
expired entries from a state store (per
watermarkPredicateForKeys predicate)

Time taken for the StreamingAggregationStateManager to


time to commit changes
commit changes to a state store

For Append output mode, the metric does not seem to


be used
For Complete output mode, the number of rows in a
StateStore (i.e. all values in a StateStore in the
StreamingAggregationStateManager that should be

617
StateStoreSaveExec

number of output rows equivalent to the number of total state rows metric)
For Update output mode, the number of rows that the
StreamingAggregationStateManager was requested to
store in a state store (that did not expire per the optional
watermarkPredicateForData predicate) that is
equivalent to the number of updated state rows metric)

Number of entries in a state store at the very end of


executing the StateStoreSaveExec operator (aka
number of total state numTotalStateRows)
rows Corresponds to numRowsTotal attribute in stateOperators in
StreamingQueryProgress (and is available as
sq.lastProgress.stateOperators for an operator).

Number of the entries that were stored as updates in a state


store in a trigger and for the keys in the result rows of the
upstream physical operator (aka numUpdatedStateRows)
For Append output mode, the number of input rows that
have not expired yet (per the required
watermarkPredicateForData predicate) and that the
StreamingAggregationStateManager was requested to
store in a state store (the time taken is the total time to
update rows metric)
number of updated For Complete output mode, the number of input rows
state rows (which should be exactly the number of output rows
from the child operator)
For Update output mode, the number of rows that the
StreamingAggregationStateManager was requested to
store in a state store (that did not expire per the optional
watermarkPredicateForData predicate) that is
equivalent to the number of output rows metric)
Corresponds to numRowsUpdated attribute in stateOperators
in StreamingQueryProgress (and is available as
sq.lastProgress.stateOperators for an operator).

Estimated memory used by a StateStore (aka stateMemory)


memory used by state after StateStoreSaveExec finished execution (per the
StateStoreMetrics of the StateStore)

Creating StateStoreSaveExec Instance


StateStoreSaveExec takes the following to be created:

Key expressions, i.e. Catalyst attributes for the grouping keys

Execution-specific StatefulOperatorStateInfo (default: None )

618
StateStoreSaveExec

Execution-specific output mode (default: None )

Event-time watermark (default: None )

Version of the state format (based on the


spark.sql.streaming.aggregation.stateFormatVersion configuration property)

Child physical operator ( SparkPlan )

Executing Physical Operator (Generating


RDD[InternalRow]) —  doExecute Method

doExecute(): RDD[InternalRow]

doExecute is part of SparkPlan Contract to generate the runtime


Note representation of an physical operator as a distributed computation over internal
binary rows on Apache Spark (i.e. RDD[InternalRow] ).

Internally, doExecute initializes metrics.

doExecute requires that the optional outputMode is at this point defined (that
Note should have happened when IncrementalExecution had prepared a streaming
aggregation for execution).

doExecute executes child physical operator and creates a StateStoreRDD with

storeUpdateFunction that:

1. Generates an unsafe projection to access the key field (using keyExpressions and the
output schema of child).

2. Branches off per output mode: Append, Complete and Update.

doExecute throws an UnsupportedOperationException when executed with an invalid output

mode:

Invalid output mode: [outputMode]

Append Output Mode

Note Append is the default output mode when not specified explicitly.

Append output mode requires that a streaming query defines event-time


Note watermark (e.g. using withWatermark operator) on the event-time column that is
used in aggregation (directly or using window standard function).

619
StateStoreSaveExec

For Append output mode, doExecute does the following:

1. Finds late (aggregate) rows from child physical operator (that have expired per
watermark)

2. Stores the late rows in the state store and increments the numUpdatedStateRows
metric

3. Gets all the added (late) rows from the state store

4. Creates an iterator that removes the late rows from the state store when requested the
next row and in the end commits the state updates

Refer to Demo: Streaming Watermark with Aggregation in Append Output Mode


Tip
for an example of StateStoreSaveExec with Append output mode.

Caution FIXME When is "Filtering state store on:" printed out?

1. Uses watermarkPredicateForData predicate to exclude matching rows and (like in


Complete output mode) stores all the remaining rows in StateStore .

2. (like in Complete output mode) While storing the rows, increments


numUpdatedStateRows metric (for every row) and records the total time in
allUpdatesTimeMs metric.

3. Takes all the rows from StateStore and returns a NextIterator that:

In getNext , finds the first row that matches watermarkPredicateForKeys predicate,


removes it from StateStore , and returns it back.

If no row was found, getNext also marks the iterator as finished.

In close , records the time to iterate over all the rows in allRemovalsTimeMs
metric, commits the updates to StateStore followed by recording the time in
commitTimeMs metric and recording StateStore metrics.

Complete Output Mode


For Complete output mode, doExecute does the following:

1. Takes all UnsafeRow rows (from the parent iterator)

2. Stores the rows by key in the state store eagerly (i.e. all rows that are available in the
parent iterator before proceeding)

620
StateStoreSaveExec

3. Commits the state updates

4. In the end, reads the key-row pairs from the state store and passes the rows along (i.e.
to the following physical operator)

The number of keys stored in the state store is recorded in numUpdatedStateRows metric.

In Complete output mode the numOutputRows metric is exactly the


Note
numTotalStateRows metric.

Refer to Demo: StateStoreSaveExec with Complete Output Mode for an example


Tip
of StateStoreSaveExec with Complete output mode.

1. Stores all rows (as UnsafeRow ) in StateStore .

2. While storing the rows, increments numUpdatedStateRows metric (for every row) and
records the total time in allUpdatesTimeMs metric.

3. Records 0 in allRemovalsTimeMs metric.

4. Commits the state updates to StateStore and records the time in commitTimeMs
metric.

5. Records StateStore metrics.

6. In the end, takes all the rows stored in StateStore and increments numOutputRows
metric.

Update Output Mode


For Update output mode, doExecute returns an iterator that filters out late aggregate rows
(per watermark if defined) and stores the "young" rows in the state store (one by one, i.e.
every next ).

With no more rows available, that removes the late rows from the state store (all at once)
and commits the state updates.

Refer to Demo: StateStoreSaveExec with Update Output Mode for an example of


Tip
StateStoreSaveExec with Update output mode.

doExecute returns Iterator of rows that uses watermarkPredicateForData predicate to

filter out late rows.

621
StateStoreSaveExec

In hasNext , when rows are no longer available:

1. Records the total time to iterate over all the rows in allUpdatesTimeMs metric.

2. removeKeysOlderThanWatermark and records the time in allRemovalsTimeMs metric.

3. Commits the updates to StateStore and records the time in commitTimeMs metric.

4. Records StateStore metrics.

In next , stores a row in StateStore and increments numOutputRows and


numUpdatedStateRows metrics.

Checking Out Whether Last Batch Execution Requires


Another Non-Data Batch or Not 
—  shouldRunAnotherBatch Method

shouldRunAnotherBatch(
newMetadata: OffsetSeqMetadata): Boolean

shouldRunAnotherBatch is part of the StateStoreWriter Contract to indicate


whether MicroBatchExecution should run another non-data batch (based on the
Note
updated OffsetSeqMetadata with the current event-time watermark and the
batch timestamp).

shouldRunAnotherBatch is positive ( true ) when all of the following are met:

Output mode is either Append or Update

Event-time watermark is defined and is older (below) the current event-time watermark
(of the given OffsetSeqMetadata )

Otherwise, shouldRunAnotherBatch is negative ( false ).

622
StreamingDeduplicateExec

StreamingDeduplicateExec Unary Physical


Operator for Streaming Deduplication
StreamingDeduplicateExec is a unary physical operator that writes state to StateStore with

support for streaming watermark.

A unary physical operator ( UnaryExecNode ) is a physical operator with a single


child physical operator.
Note
Read up on UnaryExecNode (and physical operators in general) in The
Internals of Spark SQL book.

StreamingDeduplicateExec is created exclusively when StreamingDeduplicationStrategy

plans Deduplicate unary logical operators.

Figure 1. StreamingDeduplicateExec and StreamingDeduplicationStrategy

val uniqueValues = spark.


readStream.
format("rate").
load.
dropDuplicates("value") // <-- creates Deduplicate logical operator

scala> println(uniqueValues.queryExecution.logical.numberedTreeString)
00 Deduplicate [value#214L], true
01 +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@4785f176,rate,List
(),None,List(),None,Map(),None), rate, [timestamp#213, value#214L]

scala> uniqueValues.explain
== Physical Plan ==
StreamingDeduplicate [value#214L], StatefulOperatorStateInfo(<unknown>,5a65879c-67bc-4
e77-b417-6100db6a52a2,0,0), 0
+- Exchange hashpartitioning(value#214L, 200)
+- StreamingRelation rate, [timestamp#213, value#214L]

// Start the query and hence StreamingDeduplicateExec


import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = uniqueValues.

623
StreamingDeduplicateExec

writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update).
start

// sorting not supported for non-aggregate queries


// and so values are unsorted

-------------------------------------------
Batch: 0
-------------------------------------------
+---------+-----+
|timestamp|value|
+---------+-----+
+---------+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-07-25 22:12:03.018|0 |
|2017-07-25 22:12:08.018|5 |
|2017-07-25 22:12:04.018|1 |
|2017-07-25 22:12:06.018|3 |
|2017-07-25 22:12:05.018|2 |
|2017-07-25 22:12:07.018|4 |
+-----------------------+-----+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-07-25 22:12:10.018|7 |
|2017-07-25 22:12:09.018|6 |
|2017-07-25 22:12:12.018|9 |
|2017-07-25 22:12:13.018|10 |
|2017-07-25 22:12:15.018|12 |
|2017-07-25 22:12:11.018|8 |
|2017-07-25 22:12:14.018|11 |
|2017-07-25 22:12:16.018|13 |
|2017-07-25 22:12:17.018|14 |
|2017-07-25 22:12:18.018|15 |
+-----------------------+-----+

// Eventually...
sq.stop

624
StreamingDeduplicateExec

StreamingDeduplicateExec uses the performance metrics of StateStoreWriter.

Figure 2. StreamingDeduplicateExec in web UI (Details for Query)


The output schema of StreamingDeduplicateExec is exactly the child's output schema.

The output partitioning of StreamingDeduplicateExec is exactly the child's output partitioning.

/**
// Start spark-shell with debugging and Kafka support

625
StreamingDeduplicateExec

SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=500
5" \
./bin/spark-shell \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPSHOT
*/
// Reading
val topic1 = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
option("startingOffsets", "earliest").
load

// Processing with deduplication


// Don't use watermark
// The following won't work due to https://issues.apache.org/jira/browse/SPARK-21546
/**
val records = topic1.
withColumn("eventtime", 'timestamp). // <-- just to put the right name given the pu
rpose
withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- use th
e renamed eventtime column
dropDuplicates("value"). // dropDuplicates will use watermark
// only when eventTime column exists
// include the watermark column => internal design leak?
select('key cast "string", 'value cast "string", 'eventtime).
as[(String, String, java.sql.Timestamp)]
*/

val records = topic1.


dropDuplicates("value").
select('key cast "string", 'value cast "string").
as[(String, String)]

scala> records.explain
== Physical Plan ==
*Project [cast(key#0 as string) AS key#249, cast(value#1 as string) AS value#250]
+- StreamingDeduplicate [value#1], StatefulOperatorStateInfo(<unknown>,68198b93-6184-49
ae-8098-006c32cc6192,0,0), 0
+- Exchange hashpartitioning(value#1, 200)
+- *Project [key#0, value#1]
+- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, offset#4L,
timestamp#5, timestampType#6]

// Writing
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val sq = records.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).

626
StreamingDeduplicateExec

queryName("from-kafka-topic1-to-console").
outputMode(OutputMode.Update).
start

// Eventually...
sq.stop

Enable INFO logging level for


org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec=INFO

Refer to Logging.

Executing Physical Operator (Generating


RDD[InternalRow]) —  doExecute Method

doExecute(): RDD[InternalRow]

doExecute is part of SparkPlan Contract to generate the runtime


Note representation of an physical operator as a distributed computation over internal
binary rows on Apache Spark (i.e. RDD[InternalRow] ).

Internally, doExecute initializes metrics.

doExecute executes child physical operator and creates a StateStoreRDD with

storeUpdateFunction that:

1. Generates an unsafe projection to access the key field (using keyExpressions and the
output schema of child).

2. Filters out rows from Iterator[InternalRow] that match watermarkPredicateForData


(when defined and timeoutConf is EventTimeTimeout )

3. For every row (as InternalRow )

Extracts the key from the row (using the unsafe projection above)

Gets the saved state in StateStore for the key

(when there was a state for the key in the row) Filters out (aka drops) the row

627
StreamingDeduplicateExec

(when there was no state for the key in the row) Stores a new (and empty) state for
the key and increments numUpdatedStateRows and numOutputRows metrics.

4. In the end, storeUpdateFunction creates a CompletionIterator that executes a


completion function (aka completionFunction ) after it has successfully iterated through
all the elements (i.e. when a client has consumed all the rows).

The completion function does the following:

Updates allUpdatesTimeMs metric (that is the total time to execute


storeUpdateFunction )

Updates allRemovalsTimeMs metric with the time taken to remove keys older than
the watermark from the StateStore

Updates commitTimeMs metric with the time taken to commit the changes to the
StateStore

Sets StateStore-specific metrics

Creating StreamingDeduplicateExec Instance


StreamingDeduplicateExec takes the following when created:

Duplicate keys (as used in dropDuplicates operator)

Child physical operator ( SparkPlan )

StatefulOperatorStateInfo

Event-time watermark

Checking Out Whether Last Batch Execution Requires


Another Non-Data Batch or Not 
—  shouldRunAnotherBatch Method

shouldRunAnotherBatch(newMetadata: OffsetSeqMetadata): Boolean

shouldRunAnotherBatch is part of the StateStoreWriter Contract to indicate


whether MicroBatchExecution should run another non-data batch (based on the
Note
updated OffsetSeqMetadata with the current event-time watermark and the
batch timestamp).

shouldRunAnotherBatch …​FIXME

628
StreamingDeduplicateExec

629
StreamingGlobalLimitExec

StreamingGlobalLimitExec Unary Physical


Operator
StreamingGlobalLimitExec is a unary physical operator that represents a Limit logical

operator of a streaming query at execution time.

A unary physical operator ( UnaryExecNode ) is a physical operator with a single


child physical operator.
Note
Read up on UnaryExecNode (and physical operators in general) in The
Internals of Spark SQL book.

StreamingGlobalLimitExec is created exclusively when StreamingGlobalLimitStrategy

execution planning strategy is requested to plan a Limit logical operator (in the logical plan
of a streaming query) for execution.

Limit logical operator represents Dataset.limit operator in a logical query


plan.
Note
Read up on Limit Logical Operator in The Internals of Spark SQL book.

StreamingGlobalLimitExec is a stateful physical operator that can write to a state store.

StreamingGlobalLimitExec supports Append output mode only.

The optional properties, i.e. the StatefulOperatorStateInfo and the output mode, are initially
undefined when StreamingGlobalLimitExec is created. StreamingGlobalLimitExec is updated
to hold execution-specific configuration when IncrementalExecution is requested to prepare
the logical plan (of a streaming query) for execution (when the state preparation rule is
executed).

Creating StreamingGlobalLimitExec Instance


StreamingGlobalLimitExec takes the following to be created:

Streaming Limit

Child physical operator ( SparkPlan )

StatefulOperatorStateInfo (default: None )

OutputMode (default: None )

StreamingGlobalLimitExec initializes the internal properties.

630
StreamingGlobalLimitExec

StreamingGlobalLimitExec as StateStoreWriter
StreamingGlobalLimitExec is a stateful physical operator that can write to a state store.

Performance Metrics
StreamingGlobalLimitExec uses the performance metrics of the parent StateStoreWriter.

Executing Physical Operator (Generating


RDD[InternalRow]) —  doExecute Method

doExecute(): RDD[InternalRow]

doExecute is part of SparkPlan Contract to generate the runtime


Note representation of an physical operator as a recipe for distributed computation
over internal binary rows on Apache Spark ( RDD[InternalRow] ).

doExecute …​FIXME

Internal Properties

Name Description
FIXME
keySchema
Used when…​FIXME

FIXME
valueSchema
Used when…​FIXME

631
StreamingRelationExec

StreamingRelationExec Leaf Physical Operator


StreamingRelationExec is a leaf physical operator (i.e. LeafExecNode ) that…​FIXME

StreamingRelationExec is created when StreamingRelationStrategy plans

StreamingRelation and StreamingExecutionRelation logical operators.

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

val rates = spark.


readStream.
format("rate").
load

// StreamingRelation logical operator


scala> println(rates.queryExecution.logical.numberedTreeString)
00 StreamingRelation DataSource(org.apache.spark.sql.SparkSession@31ba0af0,rate,List(),
None,List(),None,Map(),None), rate, [timestamp#0, value#1L]

// StreamingRelationExec physical operator (shown without "Exec" suffix)


scala> rates.explain
== Physical Plan ==
StreamingRelation rate, [timestamp#0, value#1L]

StreamingRelationExec is not supposed to be executed and is used…​FIXME

Creating StreamingRelationExec Instance


StreamingRelationExec takes the following when created:

The name of a streaming data source

Output attributes

632
StreamingSymmetricHashJoinExec

StreamingSymmetricHashJoinExec Binary
Physical Operator — Stream-Stream Joins
StreamingSymmetricHashJoinExec is a binary physical operator that represents a stream-

stream equi-join at execution time.

A binary physical operator ( BinaryExecNode ) is a physical operator with left and


right child physical operators.
Note
Read up on BinaryExecNode (and physical operators in general) in The
Internals of Spark SQL online book.

StreamingSymmetricHashJoinExec supports Inner , LeftOuter , and RightOuter join types

(with the left and the right keys using the exact same data types).

StreamingSymmetricHashJoinExec is created exclusively when StreamingJoinStrategy

execution planning strategy is requested to plan a logical query plan with a Join logical
operator of two streaming queries with equality predicates ( EqualTo and EqualNullSafe ).

StreamingSymmetricHashJoinExec is given execution-specific configuration (i.e.

StatefulOperatorStateInfo, event-time watermark, and JoinStateWatermarkPredicates) when


IncrementalExecution is requested to plan a streaming query for execution (and uses the

state preparation rule).

StreamingSymmetricHashJoinExec uses two OneSideHashJoiners (for the left and right sides

of the join) to manage join state when processing partitions of the left and right sides of a
stream-stream join.

StreamingSymmetricHashJoinExec is a stateful physical operator that writes to a state store.

Creating StreamingSymmetricHashJoinExec Instance


StreamingSymmetricHashJoinExec takes the following to be created:

Left keys (Catalyst expressions of the keys on the left side)

Right keys (Catalyst expressions of the keys on the right side)

Join type

Join condition ( JoinConditionSplitPredicates )

StatefulOperatorStateInfo

633
StreamingSymmetricHashJoinExec

Event-Time Watermark

Watermark Predicates for State Removal

Physical operator on the left side ( SparkPlan )

Physical operator on the right side ( SparkPlan )

StreamingSymmetricHashJoinExec initializes the internal properties.

Output Schema —  output Method

output: Seq[Attribute]

output is part of the QueryPlan Contract to describe the attributes of (the


Note
schema of) the output.

output schema depends on the join type:

For Cross and Inner ( InnerLike ) joins, it is the output schema of the left and right
operators

For LeftOuter joins, it is the output schema of the left operator with the attributes of the
right operator with nullability flag enabled ( true )

For RightOuter joins, it is the output schema of the right operator with the attributes of
the left operator with nullability flag enabled ( true )

output throws an IllegalArgumentException for other join types:

[className] should not take [joinType] as the JoinType

Output Partitioning —  outputPartitioning Method

outputPartitioning: Partitioning

outputPartitioning is part of the SparkPlan Contract to specify how data


Note
should be partitioned across different nodes in the cluster.

outputPartitioning depends on the join type:

For Cross and Inner ( InnerLike ) joins, it is a PartitioningCollection of the output


partitioning of the left and right operators

634
StreamingSymmetricHashJoinExec

For LeftOuter joins, it is a PartitioningCollection of the output partitioning of the left


operator

For RightOuter joins, it is a PartitioningCollection of the output partitioning of the


right operator

outputPartitioning throws an IllegalArgumentException for other join types:

[className] should not take [joinType] as the JoinType

Event-Time Watermark —  eventTimeWatermark Internal


Property

eventTimeWatermark: Option[Long]

When created, StreamingSymmetricHashJoinExec can be given the event-time watermark of


the current streaming micro-batch.

eventTimeWatermark is an optional property that is specified only after IncrementalExecution

was requested to apply the state preparation rule to a physical query plan of a streaming
query (to optimize (prepare) the physical plan of the streaming query once for
ContinuousExecution and every trigger for MicroBatchExecution in their queryPlanning
phases).

eventTimeWatermark is used when:

StreamingSymmetricHashJoinExec is requested to check out whether the last


Note batch execution requires another non-data batch or not
OneSideHashJoiner is requested to storeAndJoinWithOtherSide

Watermark Predicates for State Removal 


—  stateWatermarkPredicates Internal Property

stateWatermarkPredicates: JoinStateWatermarkPredicates

When created, StreamingSymmetricHashJoinExec is given a JoinStateWatermarkPredicates


for the left and right join sides (using the StreamingSymmetricHashJoinHelper utility).

stateWatermarkPredicates contains the left and right predicates only when

IncrementalExecution is requested to apply the state preparation rule to a physical query


plan of a streaming query (to optimize (prepare) the physical plan of the streaming query

635
StreamingSymmetricHashJoinExec

once for ContinuousExecution and every trigger for MicroBatchExecution in their


queryPlanning phases).

stateWatermarkPredicates is used when StreamingSymmetricHashJoinExec is


requested for the following:
Process partitions of the left and right sides of the stream-stream join (and
Note creating OneSideHashJoiners)
Checking out whether the last batch execution requires another non-data
batch or not

Required Partition Requirements 


—  requiredChildDistribution Method

requiredChildDistribution: Seq[Distribution]

requiredChildDistribution is part of the SparkPlan Contract for the required


partition requirements (aka required child distribution) of the input data, i.e. how
the output of the children physical operators is split across partitions before this
Note operator can be executed.
Read up on SparkPlan Contract in The Internals of Spark SQL online book.

requiredChildDistribution returns two HashClusteredDistributions for the left and right

keys with the required number of partitions based on the StatefulOperatorStateInfo.

requiredChildDistribution is used exclusively when EnsureRequirements


physical query plan optimization is executed (and enforces partition
requirements).
Note
Read up on EnsureRequirements Physical Query Optimization in The Internals
of Spark SQL online book.

HashClusteredDistribution becomes HashPartitioning at execution that


distributes rows across partitions (generates partition IDs of rows) based on
Murmur3Hash of the join expressions (separately for the left and right keys)
Note modulo the required number of partitions.

Read up on HashClusteredDistribution in The Internals of Spark SQL online


book.

Performance Metrics (SQLMetrics)

636
StreamingSymmetricHashJoinExec

StreamingSymmetricHashJoinExec uses the performance metrics as other stateful physical

operators that write to a state store.

Figure 1. StreamingSymmetricHashJoinExec in web UI (Details for Query)


The following table shows how the performance metrics are computed (and so their exact
meaning).

Name (in web UI) Description


total time to update
Processing time of all rows
rows

total time to remove


rows

time to commit changes

number of output rows Total number of output rows

number of total state


rows

number of updated Number of updated state rows of the left and right
state rows OneSideHashJoiners

memory used by state

Checking Out Whether Last Batch Execution Requires


Another Non-Data Batch or Not 
—  shouldRunAnotherBatch Method

shouldRunAnotherBatch(
newMetadata: OffsetSeqMetadata): Boolean

637
StreamingSymmetricHashJoinExec

shouldRunAnotherBatch is part of the StateStoreWriter Contract to indicate


whether MicroBatchExecution should run another non-data batch (based on the
Note
updated OffsetSeqMetadata with the current event-time watermark and the
batch timestamp).

shouldRunAnotherBatch is positive ( true ) when all of the following are positive:

Either the left or right join state watermark predicates are defined (in the
JoinStateWatermarkPredicates)

Event-time watermark threshold (of the StreamingSymmetricHashJoinExec operator) is


defined and the current event-time watermark threshold of the given OffsetSeqMetadata
is above (greater than) it, i.e. moved above

shouldRunAnotherBatch is negative ( false ) otherwise.

Executing Physical Operator (Generating


RDD[InternalRow]) —  doExecute Method

doExecute(): RDD[InternalRow]

doExecute is part of SparkPlan Contract to generate the runtime


Note representation of a physical operator as a recipe for distributed computation
over internal binary rows on Apache Spark ( RDD[InternalRow] ).

doExecute first requests the StreamingQueryManager for the StateStoreCoordinatorRef to the

StateStoreCoordinator RPC endpoint (for the driver).

doExecute then uses SymmetricHashJoinStateManager utility to get the names of the state

stores for the left and right sides of the streaming join.

In the end, doExecute requests the left and right child physical operators to execute
(generate an RDD) and then stateStoreAwareZipPartitions with processPartitions (and with
the StateStoreCoordinatorRef and the state stores).

Processing Partitions of Left and Right Sides of Stream-


Stream Join —  processPartitions Internal Method

processPartitions(
leftInputIter: Iterator[InternalRow],
rightInputIter: Iterator[InternalRow]): Iterator[InternalRow]

638
StreamingSymmetricHashJoinExec

processPartitions records the current time (as updateStartTimeNs for the total time to

update rows performance metric in onOutputCompletion).

processPartitions creates a new predicate (postJoinFilter) based on the bothSides of the

JoinConditionSplitPredicates if defined or true literal.

processPartitions creates a OneSideHashJoiner for the LeftSide and all other properties

for the left-hand join side ( leftSideJoiner ).

processPartitions creates a OneSideHashJoiner for the RightSide and all other properties

for the right-hand join side ( rightSideJoiner ).

processPartitions requests the OneSideHashJoiner for the left-hand join side to

storeAndJoinWithOtherSide with the right-hand side one (that creates a leftOutputIter row
iterator) and the OneSideHashJoiner for the right-hand join side to do the same with the left-
hand side one (and creates a rightOutputIter row iterator).

processPartitions records the current time (as innerOutputCompletionTimeNs for the total

time to remove rows performance metric in onOutputCompletion).

processPartitions creates a CompletionIterator with the left and right output iterators

(with the rows of the leftOutputIter first followed by rightOutputIter ). When no rows are
left to process, the CompletionIterator records the completion time.

processPartitions creates a join-specific output Iterator[InternalRow] of the output rows

based on the join type (of the StreamingSymmetricHashJoinExec ):

For Inner joins, processPartitions simply uses the output iterator of the left and right
rows

For LeftOuter joins, processPartitions …​

For RightOuter joins, processPartitions …​

For other joins, processPartitions simply throws an IllegalArgumentException .

processPartitions creates an UnsafeProjection for the output (and the output of the left

and right child operators) that counts all the rows of the join-specific output iterator (as the
numOutputRows metric) and generate an output projection.

In the end, processPartitions returns a CompletionIterator with with the output iterator
with the rows counted (as numOutputRows metric) and onOutputCompletion completion
function.

processPartitions is used exclusively when StreamingSymmetricHashJoinExec


Note
physical operator is requested to execute.

639
StreamingSymmetricHashJoinExec

Calculating Performance Metrics (Output Completion


Callback) —  onOutputCompletion Internal Method

onOutputCompletion: Unit

onOutputCompletion calculates the total time to update rows performance metric (that is the

time since the processPartitions was executed).

onOutputCompletion adds the time for the inner join to complete (since

innerOutputCompletionTimeNs time marker) to the total time to remove rows performance


metric.

onOutputCompletion records the time to remove old state (per the join state watermark

predicate for the left and the right streaming queries) and adds it to the total time to remove
rows performance metric.

onOutputCompletion triggers the old state removal eagerly by iterating over the
Note
state rows to be deleted.

onOutputCompletion records the time for the left and right OneSideHashJoiners to commit

any state changes that becomes the time to commit changes performance metric.

onOutputCompletion calculates the number of updated state rows performance metric (as

the number of updated state rows of the left and right streaming queries).

onOutputCompletion calculates the number of total state rows performance metric (as the

sum of the number of keys in the KeyWithIndexToValueStore of the left and right streaming
queries).

onOutputCompletion calculates the memory used by state performance metric (as the sum

of the memory used by the KeyToNumValuesStore and KeyWithIndexToValueStore of the


left and right streams).

In the end, onOutputCompletion calculates the custom metrics.

Internal Properties

640
StreamingSymmetricHashJoinExec

Name Description

Hadoop Configuration broadcast (to the Spark cluster)


hadoopConfBcast
Used exclusively to create a
SymmetricHashJoinStateManager

SymmetricHashJoinStateManager
Used when OneSideHashJoiner is requested to
joinStateManager
storeAndJoinWithOtherSide, removeOldState,
commitStateAndGetMetrics, and for the values for a given
key

GenericInternalRow of the size of the output schema of the


nullLeft
left physical operator

GenericInternalRow of the size of the output schema of the


nullRight
right physical operator

StateStoreConf
storeConf
Used exclusively to create a
SymmetricHashJoinStateManager

641
FlatMapGroupsWithStateStrategy

FlatMapGroupsWithStateStrategy Execution
Planning Strategy for FlatMapGroupsWithState
Logical Operator
FlatMapGroupsWithStateStrategy is an execution planning strategy that can plan streaming

queries with FlatMapGroupsWithState unary logical operators to


FlatMapGroupsWithStateExec physical operator (with undefined
StatefulOperatorStateInfo , batchTimestampMs , and eventTimeWatermark ).

Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.

FlatMapGroupsWithStateStrategy is used exclusively when IncrementalExecution is

requested to plan a streaming query.

Demo: Using FlatMapGroupsWithStateStrategy

642
FlatMapGroupsWithStateStrategy

import org.apache.spark.sql.streaming.GroupState
val stateFunc = (key: Long, values: Iterator[(Timestamp, Long)], state: GroupState[Long
]) => {
Iterator((key, values.size))
}
import java.sql.Timestamp
import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode}
val numGroups = spark.
readStream.
format("rate").
load.
as[(Timestamp, Long)].
groupByKey { case (time, value) => value % 2 }.
flatMapGroupsWithState(OutputMode.Update, GroupStateTimeout.NoTimeout)(stateFunc)

scala> numGroups.explain(true)
== Parsed Logical Plan ==
'SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS
_1#267L, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#268]
+- 'FlatMapGroupsWithState <function3>, unresolveddeserializer(upcast(getcolumnbyordin
al(0, LongType), LongType, - root class: "scala.Long"), value#262L), unresolveddeseria
lizer(newInstance(class scala.Tuple2), timestamp#253, value#254L), [value#262L], [time
stamp#253, value#254L], obj#266: scala.Tuple2, class[value[0]: bigint], Update, false,
NoTimeout
+- AppendColumns <function1>, class scala.Tuple2, [StructField(_1,TimestampType,tru
e), StructField(_2,LongType,false)], newInstance(class scala.Tuple2), [input[0, bigint
, false] AS value#262L]
+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@38bcac50,rate,
List(),None,List(),None,Map(),None), rate, [timestamp#253, value#254L]

...

== Physical Plan ==
*SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#267L, asser
tnotnull(input[0, scala.Tuple2, true])._2 AS _2#268]
+- FlatMapGroupsWithState <function3>, value#262: bigint, newInstance(class scala.Tupl
e2), [value#262L], [timestamp#253, value#254L], obj#266: scala.Tuple2, StatefulOperato
rStateInfo(<unknown>,84b5dccb-3fa6-4343-a99c-6fa5490c9b33,0,0), class[value[0]: bigint
], Update, NoTimeout, 0, 0
+- *Sort [value#262L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value#262L, 200)
+- AppendColumns <function1>, newInstance(class scala.Tuple2), [input[0, bigi
nt, false] AS value#262L]
+- StreamingRelation rate, [timestamp#253, value#254L]

643
StatefulAggregationStrategy

StatefulAggregationStrategy Execution
Planning Strategy — EventTimeWatermark and
Aggregate Logical Operators
StatefulAggregationStrategy is an execution planning strategy that is used to plan

streaming queries with the two logical operators:

EventTimeWatermark logical operator (for Dataset.withWatermark operator)

Aggregate logical operator (for Dataset.groupBy and Dataset.groupByKey operators,

and GROUP BY SQL clause)

Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.

StatefulAggregationStrategy is used exclusively when IncrementalExecution is requested

to plan a streaming query.

StatefulAggregationStrategy is available using SessionState .

spark.sessionState.planner.StatefulAggregationStrategy

Table 1. StatefulAggregationStrategy’s Logical to Physical Operator Conversions


Logical Operator Physical Operator
EventTimeWatermark EventTimeWatermarkExec

In the order of preference:

1. HashAggregateExec

2. ObjectHashAggregateExec
Aggregate
3. SortAggregateExec

Read up on Aggregation Execution Planning


Tip Strategy for Aggregate Physical Operators in
The Internals of Spark SQL book.

644
StatefulAggregationStrategy

val counts = spark.


readStream.
format("rate").
load.
groupBy(window($"timestamp", "5 seconds") as "group").
agg(count("value") as "count").
orderBy("group")
scala> counts.explain
== Physical Plan ==
*Sort [group#6 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(group#6 ASC NULLS FIRST, 200)
+- *HashAggregate(keys=[window#13], functions=[count(value#1L)])
+- StateStoreSave [window#13], StatefulOperatorStateInfo(<unknown>,736d67c2-6daa
-4c4c-9c4b-c12b15af20f4,0,0), Append, 0
+- *HashAggregate(keys=[window#13], functions=[merge_count(value#1L)])
+- StateStoreRestore [window#13], StatefulOperatorStateInfo(<unknown>,736d
67c2-6daa-4c4c-9c4b-c12b15af20f4,0,0)
+- *HashAggregate(keys=[window#13], functions=[merge_count(value#1L)])
+- Exchange hashpartitioning(window#13, 200)
+- *HashAggregate(keys=[window#13], functions=[partial_count(valu
e#1L)])
+- *Project [named_struct(start, precisetimestampconversion(((
((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#0, TimestampType, L
ongType) - 0) as double) / 5000000.0)) as double) = (cast((precisetimestampconversion(
timestamp#0, TimestampType, LongType) - 0) as double) / 5000000.0)) THEN (CEIL((cast((
precisetimestampconversion(timestamp#0, TimestampType, LongType) - 0) as double) / 500
0000.0)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#0, TimestampType,
LongType) - 0) as double) / 5000000.0)) END + 0) - 1) * 5000000) + 0), LongType, Times
tampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetime
stampconversion(timestamp#0, TimestampType, LongType) - 0) as double) / 5000000.0)) as
double) = (cast((precisetimestampconversion(timestamp#0, TimestampType, LongType) - 0
) as double) / 5000000.0)) THEN (CEIL((cast((precisetimestampconversion(timestamp#0, T
imestampType, LongType) - 0) as double) / 5000000.0)) + 1) ELSE CEIL((cast((precisetim
estampconversion(timestamp#0, TimestampType, LongType) - 0) as double) / 5000000.0)) E
ND + 0) - 1) * 5000000) + 5000000), LongType, TimestampType)) AS window#13, value#1L]
+- *Filter isnotnull(timestamp#0)
+- StreamingRelation rate, [timestamp#0, value#1L]

import org.apache.spark.sql.streaming.{OutputMode, Trigger}


import scala.concurrent.duration._
val consoleOutput = counts.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
queryName("counts").
outputMode(OutputMode.Complete). // <-- required for groupBy
start

// Eventually...
consoleOutput.stop

645
StatefulAggregationStrategy

Selecting Aggregate Physical Operator Given Aggregate


Expressions —  AggUtils.planStreamingAggregation
Internal Method

planStreamingAggregation(
groupingExpressions: Seq[NamedExpression],
functionsWithoutDistinct: Seq[AggregateExpression],
resultExpressions: Seq[NamedExpression],
child: SparkPlan): Seq[SparkPlan]

planStreamingAggregation takes the grouping attributes (from groupingExpressions ).

Note groupingExpressions corresponds to the grouping function in groupBy operator.

planStreamingAggregation creates an aggregate physical operator (called

partialAggregate ) with:

requiredChildDistributionExpressions undefined (i.e. None )

initialInputBufferOffset as 0

functionsWithoutDistinct in Partial mode

child operator as the input child

planStreamingAggregation creates one of the following aggregate physical


operators (in the order of preference):
1. HashAggregateExec

2. ObjectHashAggregateExec

Note 3. SortAggregateExec

planStreamingAggregation uses AggUtils.createAggregate method to select an


aggregate physical operator that you can read about in Selecting Aggregate
Physical Operator Given Aggregate Expressions —  AggUtils.createAggregate
Internal Method in Mastering Apache Spark 2 gitbook.

planStreamingAggregation creates an aggregate physical operator (called partialMerged1 )

with:

requiredChildDistributionExpressions based on the input groupingExpressions

initialInputBufferOffset as the length of groupingExpressions

functionsWithoutDistinct in PartialMerge mode

child operator as partialAggregate aggregate physical operator created above

646
StatefulAggregationStrategy

planStreamingAggregation creates StateStoreRestoreExec with the grouping attributes,

undefined StatefulOperatorStateInfo , and partialMerged1 aggregate physical operator


created above.

planStreamingAggregation creates an aggregate physical operator (called partialMerged2 )

with:

child operator as StateStoreRestoreExec physical operator created above

The only difference between partialMerged1 and partialMerged2 steps is the


Note
child physical operator.

planStreamingAggregation creates StateStoreSaveExec with:

the grouping attributes based on the input groupingExpressions

No stateInfo , outputMode and eventTimeWatermark

child operator as partialMerged2 aggregate physical operator created above

In the end, planStreamingAggregation creates the final aggregate physical operator (called
finalAndCompleteAggregate ) with:

requiredChildDistributionExpressions based on the input groupingExpressions

initialInputBufferOffset as the length of groupingExpressions

functionsWithoutDistinct in Final mode

child operator as StateStoreSaveExec physical operator created above

planStreamingAggregation is used exclusively when


Note
StatefulAggregationStrategy plans a streaming aggregation.

647
StreamingDeduplicationStrategy

StreamingDeduplicationStrategy Execution
Planning Strategy for Deduplicate Logical
Operator
StreamingDeduplicationStrategy is an execution planning strategy that can plan streaming

queries with Deduplicate logical operators (over streaming queries) to


StreamingDeduplicateExec physical operators.

Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.

Deduplicate logical operator represents Dataset.dropDuplicates operator in a


Note
logical query plan.

StreamingDeduplicationStrategy is available using SessionState .

spark.sessionState.planner.StreamingDeduplicationStrategy

Demo: Using StreamingDeduplicationStrategy

FIXME

648
StreamingGlobalLimitStrategy

StreamingGlobalLimitStrategy Execution
Planning Strategy
StreamingGlobalLimitStrategy is an execution planning strategy that can plan streaming

queries with ReturnAnswer and Limit logical operators (over streaming queries) with the
Append output mode to StreamingGlobalLimitExec physical operator.

Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.

StreamingGlobalLimitStrategy is used (and created) exclusively when IncrementalExecution

is requested to plan a streaming query.

StreamingGlobalLimitStrategy takes a single OutputMode to be created (which is the

OutputMode of the IncrementalExecution).

Demo: Using StreamingGlobalLimitStrategy

FIXME

649
StreamingJoinStrategy

StreamingJoinStrategy Execution Planning


Strategy — Stream-Stream Equi-Joins
StreamingJoinStrategy is an execution planning strategy that can plan streaming queries

with Join logical operators of two streaming queries to a


StreamingSymmetricHashJoinExec physical operator.

Read up on Execution Planning Strategies in The Internals of Spark SQL online


Tip
book.

StreamingJoinStrategy throws an AnalysisException when applied to a Join logical

operator with no equality predicate:

Stream-stream join without equality predicate is not supported

StreamingJoinStrategy is used exclusively when IncrementalExecution is requested to plan

a streaming query.

StreamingJoinStrategy does not print out any messages to the logs.


StreamingJoinStrategy however uses ExtractEquiJoinKeys Scala extractor for
destructuring Join logical operators that does print out DEBUG messages to the
logs.
Read up on ExtractEquiJoinKeys in The Internals of Spark SQL online book.
Enable ALL logging level for
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys to see what
Tip
happens inside.

Add the following line to conf/log4j.properties :

log4j.logger.org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys=ALL

Refer to Logging.

650
StreamingRelationStrategy

StreamingRelationStrategy Execution Planning


Strategy for StreamingRelation and
StreamingExecutionRelation Logical Operators
StreamingRelationStrategy is an execution planning strategy that can plan streaming

queries with StreamingRelation, StreamingExecutionRelation, and StreamingRelationV2


logical operators to StreamingRelationExec physical operators.

Figure 1. StreamingRelationStrategy, StreamingRelation, StreamingExecutionRelation and


StreamingRelationExec Operators
Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.

StreamingRelationStrategy is used exclusively when IncrementalExecution is requested to

plan a streaming query.

StreamingRelationStrategy is available using SessionState (of a SparkSession ).

spark.sessionState.planner.StreamingRelationStrategy

Demo: Using StreamingRelationStrategy

651
StreamingRelationStrategy

val rates = spark.


readStream.
format("rate").
load // <-- gives a streaming Dataset with a logical plan with StreamingRelation log
ical operator

// StreamingRelation logical operator for the rate streaming source


scala> println(rates.queryExecution.logical.numberedTreeString)
00 StreamingRelation DataSource(org.apache.spark.sql.SparkSession@31ba0af0,rate,List(),
None,List(),None,Map(),None), rate, [timestamp#0, value#1L]

// StreamingRelationExec physical operator (shown without "Exec" suffix)


scala> rates.explain
== Physical Plan ==
StreamingRelation rate, [timestamp#0, value#1L]

// Let's do the planning manually


import spark.sessionState.planner.StreamingRelationStrategy
val physicalPlan = StreamingRelationStrategy.apply(rates.queryExecution.logical).head
scala> println(physicalPlan.numberedTreeString)
00 StreamingRelation rate, [timestamp#0, value#1L]

652
UnsupportedOperationChecker

UnsupportedOperationChecker
UnsupportedOperationChecker checks whether the logical plan of a streaming query uses

supported operations only.

UnsupportedOperationChecker is used exclusively when the internal


Note spark.sql.streaming.unsupportedOperationCheck Spark property is enabled
(which is by default).

UnsupportedOperationChecker comes actually with two methods, i.e.


checkForBatch and checkForStreaming, whose names reveal the different
flavours of Spark SQL (as of 2.0), i.e. batch and streaming, respectively.
Note
The Spark Structured Streaming gitbook is solely focused on
checkForStreaming method.

checkForStreaming Method

checkForStreaming(
plan: LogicalPlan,
outputMode: OutputMode): Unit

checkForStreaming asserts that the following requirements hold:

1. Only one streaming aggregation is allowed

2. Streaming aggregation with Append output mode requires watermark (on the grouping
expressions)

3. Multiple flatMapGroupsWithState operators are only allowed with Append output mode

checkForStreaming …​FIXME

checkForStreaming finds all streaming aggregates (i.e. Aggregate logical operators with

streaming sources).

Aggregate logical operator represents Dataset.groupBy and


Note Dataset.groupByKey operators (and SQL’s GROUP BY clause) in a logical query
plan.

checkForStreaming asserts that there is exactly one streaming aggregation in a streaming

query.

Otherwise, checkForStreaming reports a AnalysisException :

653
UnsupportedOperationChecker

Multiple streaming aggregations are not supported with streaming


DataFrames/Datasets

checkForStreaming asserts that watermark was defined for a streaming aggregation with

Append output mode (on at least one of the grouping expressions).

Otherwise, checkForStreaming reports a AnalysisException :

Append output mode not supported when there are streaming


aggregations on streaming DataFrames/DataSets without watermark

Caution FIXME

checkForStreaming counts all FlatMapGroupsWithState logical operators (on streaming

Datasets with isMapGroupsWithState flag disabled).

FlatMapGroupsWithState logical operator represents


KeyValueGroupedDataset.mapGroupsWithState and
Note
KeyValueGroupedDataset.flatMapGroupsWithState operators in a logical query
plan.

FlatMapGroupsWithState.isMapGroupsWithState flag is disabled when…​


Note
FIXME

checkForStreaming asserts that multiple FlatMapGroupsWithState logical operators are only

used when:

outputMode is Append output mode

outputMode of the FlatMapGroupsWithState logical operators is also Append output


mode

Caution FIXME Reference to an example in flatMapGroupsWithState

Otherwise, checkForStreaming reports a AnalysisException :

Multiple flatMapGroupsWithStates are not supported when they are


not all in append mode or the output mode is not append on a
streaming DataFrames/Datasets

Caution FIXME

654
UnsupportedOperationChecker

checkForStreaming is used exclusively when StreamingQueryManager is


requested to create a StreamingQueryWrapper (for starting a streaming query),
Note
but only when the internal spark.sql.streaming.unsupportedOperationCheck
Spark property is enabled (which is by default).

655

You might also like