Spark Structured Streaming
Spark Structured Streaming
of Contents
Introduction 1.1
Spark Structured Streaming and Streaming Queries 1.2
Batch Processing Time 1.2.1
Internals of Streaming Queries 1.3
Streaming Join
Streaming Join 2.1
StateStoreAwareZipPartitionsRDD 2.2
SymmetricHashJoinStateManager 2.3
StateStoreHandler 2.3.1
KeyToNumValuesStore 2.3.2
KeyWithIndexToValueStore 2.3.3
OneSideHashJoiner 2.4
JoinStateWatermarkPredicates 2.5
JoinStateWatermarkPredicate 2.5.1
StateStoreAwareZipPartitionsHelper 2.6
StreamingSymmetricHashJoinHelper 2.7
StreamingJoinHelper 2.8
1
Demos
Demos 4.1
Internals of FlatMapGroupsWithStateExec Physical Operator 4.2
Arbitrary Stateful Streaming Aggregation with
KeyValueGroupedDataset.flatMapGroupsWithState Operator 4.3
Exploring Checkpointed State 4.4
Streaming Watermark with Aggregation in Append Output Mode 4.5
Streaming Query for Running Counts (Socket Source and Complete Output Mode) 4.6
Streaming Aggregation with Kafka Data Source 4.7
groupByKey Streaming Aggregation in Update Mode 4.8
StateStoreSaveExec with Complete Output Mode 4.9
StateStoreSaveExec with Update Output Mode 4.10
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI) 4.11
current_timestamp Function For Processing Time in Streaming Queries 4.12
Using StreamingQueryManager for Query Termination Management 4.13
Streaming Aggregation
Streaming Aggregation 5.1
StateStoreRDD 5.2
StateStoreOps 5.2.1
StreamingAggregationStateManager 5.3
StreamingAggregationStateManagerBaseImpl 5.3.1
StreamingAggregationStateManagerImplV1 5.3.2
StreamingAggregationStateManagerImplV2 5.3.3
2
StateStoreId 6.5.1
HDFSBackedStateStore 6.5.2
StateStoreProvider 6.6
StateStoreProviderId 6.6.1
HDFSBackedStateStoreProvider 6.6.2
StateStoreCoordinator 6.7
StateStoreCoordinatorRef 6.7.1
WatermarkSupport 6.8
StatefulOperator 6.9
StateStoreReader 6.9.1
StateStoreWriter 6.9.2
StatefulOperatorStateInfo 6.10
StateStoreMetrics 6.11
StateStoreCustomMetric 6.12
StateStoreUpdater 6.13
EventTimeStatsAccum 6.14
StateStoreConf 6.15
3
DataStreamWriter 8.2
OutputMode 8.2.1
Trigger 8.2.2
StreamingQuery 8.3
Streaming Operators 8.4
dropDuplicates Operator 8.4.1
explain Operator 8.4.2
groupBy Operator 8.4.3
groupByKey Operator 8.4.4
withWatermark Operator 8.4.5
window Function 8.5
KeyValueGroupedDataset 8.6
mapGroupsWithState Operator 8.6.1
flatMapGroupsWithState Operator 8.6.2
StreamingQueryManager 8.7
SQLConf 8.8
Configuration Properties 8.9
4
FileStreamSink 10.2
FileStreamSinkLog 10.3
SinkFileStatus 10.4
ManifestFileCommitProtocol 10.5
MetadataLogFileIndex 10.6
5
Rate Data Source
RateSourceProvider 13.1
RateStreamSource 13.2
RateStreamMicroBatchReader 13.3
6
Offsets and Metadata Checkpointing (Fault-
Tolerance and Reliability)
Offsets and Metadata Checkpointing 18.1
MetadataLog 18.2
HDFSMetadataLog 18.3
CommitLog 18.4
CommitMetadata 18.4.1
OffsetSeqLog 18.5
OffsetSeq 18.5.1
CompactibleFileStreamLog 18.6
FileStreamSourceLog 18.6.1
OffsetSeqMetadata 18.7
CheckpointFileManager 18.8
FileContextBasedCheckpointFileManager 18.8.1
FileSystemBasedCheckpointFileManager 18.8.2
Offset 18.9
StreamProgress 18.10
7
Continuous Stream Processing (Structured
Streaming V2)
Continuous Stream Processing 20.1
ContinuousExecution 20.2
ContinuousReadSupport Contract 20.3
ContinuousReader Contract 20.4
RateStreamContinuousReader 20.5
EpochCoordinator RPC Endpoint 20.6
EpochCoordinatorRef 20.6.1
EpochTracker 20.6.2
ContinuousQueuedDataReader 20.7
DataReaderThread 20.7.1
EpochMarkerGenerator 20.7.2
PartitionOffset 20.8
ContinuousExecutionRelation Leaf Logical Operator 20.9
WriteToContinuousDataSource Unary Logical Operator 20.10
WriteToContinuousDataSourceExec Unary Physical Operator 20.11
ContinuousWriteRDD 20.11.1
ContinuousDataSourceRDD 20.12
Logical Operators
EventTimeWatermark Unary Logical Operator 22.1
FlatMapGroupsWithState Unary Logical Operator 22.2
8
Deduplicate Unary Logical Operator 22.3
MemoryPlan Logical Query Plan 22.4
StreamingRelation Leaf Logical Operator for Streaming Source 22.5
StreamingRelationV2 Leaf Logical Operator 22.6
StreamingExecutionRelation Leaf Logical Operator for Streaming Source At Execution
22.7
Physical Operators
EventTimeWatermarkExec 23.1
FlatMapGroupsWithStateExec 23.2
StateStoreRestoreExec 23.3
StateStoreSaveExec 23.4
StreamingDeduplicateExec 23.5
StreamingGlobalLimitExec 23.6
StreamingRelationExec 23.7
StreamingSymmetricHashJoinExec 23.8
Varia
UnsupportedOperationChecker 25.1
9
Introduction
— Flannery O'Connor
I’m Jacek Laskowski, a freelance IT consultant, software engineer and technical instructor
specializing in Apache Spark, Apache Kafka and Kafka Streams (with Scala and sbt).
I offer software development and consultancy services with hands-on in-depth workshops
and mentoring. Reach out to me at [email protected] or @jaceklaskowski to discuss
opportunities.
Consider joining me at Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw,
Poland.
I’m also writing other books in the "The Internals of" series about Apache Spark,
Tip
Spark SQL, Apache Kafka, and Kafka Streams.
Expect text and code snippets from a variety of public sources. Attribution follows.
Now, let me introduce you to Spark Structured Streaming and Streaming Queries.
10
Spark Structured Streaming and Streaming Queries
Streaming queries can be expressed using a high-level declarative streaming API (Dataset
API) or good ol' SQL (SQL over stream / streaming SQL). The declarative streaming Dataset
API and SQL are executed on the underlying highly-optimized Spark SQL engine.
The semantics of the Structured Streaming model is as follows (see the article Structured
Streaming In Apache Spark):
At any time, the output of a continuous application is equivalent to executing a batch job
on a prefix of the data.
As of Spark 2.2.0, Structured Streaming has been marked stable and ready for
production use. With that the other older streaming module Spark Streaming is
Note
considered obsolete and not used for developing new streaming applications
with Apache Spark.
Spark Structured Streaming comes with two stream execution engines for executing
streaming queries:
The goal of Spark Structured Streaming is to unify streaming, interactive, and batch queries
over structured datasets for developing end-to-end stream processing applications dubbed
continuous applications using Spark SQL’s Datasets API with additional support for the
following features:
Streaming Aggregation
Streaming Join
Streaming Watermark
11
Spark Structured Streaming and Streaming Queries
Structured Streaming introduces the concept of streaming datasets that are infinite
datasets with primitives like input streaming data sources and output streaming data sinks.
assert(batchQuery.isStreaming == false)
assert(streamingQuery.isStreaming)
Read up on Spark SQL, Datasets and logical plans in The Internals of Spark SQL
book.
Structured Streaming models a stream of data as an infinite (and hence continuous) table
that could be changed every streaming batch.
You can specify output mode of a streaming dataset which is what gets written to a
streaming sink (i.e. the infinite result table) when there is a new data available.
Streaming Datasets use streaming query plans (as opposed to regular batch Datasets that
are based on batch query plans).
assert(streamingQuery.isStreaming)
With Structured Streaming, Spark 2 aims at simplifying streaming analytics with little to no
need to reason about effective data streaming (trying to hide the unnecessary complexity in
your streaming analytics architectures).
12
Spark Structured Streaming and Streaming Queries
StreamingQuery
Streaming Source
Streaming Sink
StreamingQueryManager
Structured Streaming follows micro-batch model and periodically fetches data from the data
source (and uses the DataFrame data abstraction to represent the fetched data for a certain
batch).
With Datasets as Spark SQL’s view of structured data, structured streaming checks input
sources for new data every trigger (time) and executes the (continuous) queries.
The feature has also been called Streaming Spark SQL Query, Streaming
Note DataFrames, Continuous DataFrame or Continuous Query. There have
been lots of names before the Spark project settled on Structured Streaming.
(video) The Future of Real Time in Spark from Spark Summit East 2016 in which
Reynold Xin presents the concept of Streaming DataFrames
(video) A Deep Dive Into Structured Streaming by Tathagata "TD" Das from Spark
Summit 2016
13
Batch Processing Time
The following standard functions (and their Catalyst expressions) allow accessing the batch
processing time in Micro-Batch Stream Processing:
Internals
GroupStateImpl is given the batch processing time when created for a streaming query (that
is actually the batch processing time of the FlatMapGroupsWithStateExec physical
operator).
14
Internals of Streaming Queries
5. StreamingQuery
6. StreamingQueryManager
import org.apache.spark.sql.SparkSession
assert(spark.isInstanceOf[SparkSession])
import org.apache.spark.sql.streaming.DataStreamReader
assert(reader.isInstanceOf[DataStreamReader])
The fluent API of DataStreamReader allows you to describe the input data source (e.g.
DataStreamReader.format and DataStreamReader.options) using method chaining (with the
goal of making the readability of the source code close to that of ordinary written prose,
essentially creating a domain-specific language within the interface. See Fluent interface
article in Wikipedia).
reader
.format("csv")
.option("delimiter", "|")
15
Internals of Streaming Queries
There are a couple of built-in data source formats. Their names are the names of the
corresponding DataStreamReader methods and so act like shortcuts of
DataStreamReader.format (where you have to specify the format by name), i.e. csv, json, orc,
You may also want to use DataStreamReader.schema method to specify the schema of the
streaming data source.
In the end, you use DataStreamReader.load method that simply creates a streaming Dataset
(the good ol' Dataset that you may have already used in Spark SQL).
import org.apache.spark.sql.DataFrame
assert(input.isInstanceOf[DataFrame])
The Dataset has the isStreaming property enabled that is basically the only way you could
distinguish streaming Datasets from regular, batch Datasets.
assert(input.isStreaming)
In other words, Spark Structured Streaming is designed to extend the features of Spark SQL
and let your structured queries be streaming queries.
Whenever you create a Dataset (be it batch in Spark SQL or streaming in Spark Structured
Streaming) is when you create a logical query plan using the high-level Dataset DSL.
16
Internals of Streaming Queries
Spark Structured Streaming gives you two logical operators to represent streaming sources,
i.e. StreamingRelationV2 and StreamingRelation.
When DataStreamReader.load method is executed, load first looks up the requested data
source (that you specified using DataStreamReader.format) and creates an instance of it
(instantiation). That’d be data source resolution step (that I described in…FIXME).
DataStreamReader.load is where you can find the intersection of the former Micro-Batch
Stream Processing V1 API with the new Continuous Stream Processing V2 API.
For all other types of streaming data sources, DataStreamReader.load creates a logical query
plan with a StreamingRelation leaf logical operator. That is the former V1 code path.
17
Internals of Streaming Queries
Please note that a streaming Dataset is a regular Dataset (with some streaming-related
limitations).
import org.apache.spark.sql.Dataset
assert(countByTime.isInstanceOf[Dataset[_]])
The point is to understand that the Dataset API is a domain-specific language (DSL) to build
a more sophisticated stream processing pipeline that you could also build using the low-level
logical operators directly.
Use Dataset.explain to learn the underlying logical and physical query plans.
18
Internals of Streaming Queries
assert(countByTime.isStreaming)
== Physical Plan ==
*(5) HashAggregate(keys=[timestamp#88-T10000ms], functions=[count(1)], output=[timesta
mp#88-T10000ms, count#131L])
+- StateStoreSave [timestamp#88-T10000ms], state info [ checkpoint = <unknown>, runId
= 28606ba5-9c7f-4f1f-ae41-e28d75c4d948, opId = 0, ver = 0, numPartitions = 200], Append
, 0, 2
+- *(4) HashAggregate(keys=[timestamp#88-T10000ms], functions=[merge_count(1)], out
put=[timestamp#88-T10000ms, count#136L])
+- StateStoreRestore [timestamp#88-T10000ms], state info [ checkpoint = <unknown
>, runId = 28606ba5-9c7f-4f1f-ae41-e28d75c4d948, opId = 0, ver = 0, numPartitions = 200
], 2
+- *(3) HashAggregate(keys=[timestamp#88-T10000ms], functions=[merge_count(1)
], output=[timestamp#88-T10000ms, count#136L])
+- Exchange hashpartitioning(timestamp#88-T10000ms, 200)
+- *(2) HashAggregate(keys=[timestamp#88-T10000ms], functions=[partial_
count(1)], output=[timestamp#88-T10000ms, count#136L])
+- EventTimeWatermark timestamp#88: timestamp, interval 10 seconds
+- *(1) Project [timestamp#88]
+- StreamingRelation rate, [timestamp#88, value#89L]
19
Internals of Streaming Queries
Please note that most of the stream processing operators you may also have used in batch
structured queries in Spark SQL. Again, the distinction between Spark SQL and Spark
Structured Streaming is very thin from a developer’s point of view.
import org.apache.spark.sql.streaming.DataStreamWriter
assert(writer.isInstanceOf[DataStreamWriter[_]])
The fluent API of DataStreamWriter allows you to describe the output data sink (e.g.
DataStreamWriter.format and DataStreamWriter.options) using method chaining (with the
goal of making the readability of the source code close to that of ordinary written prose,
essentially creating a domain-specific language within the interface. See Fluent interface
article in Wikipedia).
writer
.format("csv")
.option("delimiter", "\t")
20
Internals of Streaming Queries
Like in DataStreamReader data source formats, there are a couple of built-in data sink
formats. Unlike data source formats, their names do not have corresponding
DataStreamWriter methods. The reason is that you will use DataStreamWriter.start to create
There are however two special output formats that do have corresponding
DataStreamWriter methods, i.e. DataStreamWriter.foreach and
DataStreamWriter API defines two new concepts (that are not available in the "base" Spark
SQL):
You may also want to give a streaming query a name using DataStreamWriter.queryName
method.
In the end, you use DataStreamWriter.start method to create and immediately start a
StreamingQuery.
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = writer
.format("console")
.option("truncate", false)
.option("checkpointLocation", "/tmp/csv-to-csv-checkpoint")
.outputMode(OutputMode.Append)
.trigger(Trigger.ProcessingTime(30.seconds))
.queryName("csv-to-csv")
.start("/tmp")
import org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isInstanceOf[StreamingQuery])
When DataStreamWriter is requested to start a streaming query, it allows for the following
data source formats:
21
Internals of Streaming Queries
StreamWriteSupport
StreamingQuery
When a stream processing pipeline is started (using DataStreamWriter.start method),
DataStreamWriter creates a StreamingQuery and requests the StreamingQueryManager to
StreamingQueryManager
StreamingQueryManager is used to manage streaming queries.
22
Streaming Join
Streaming Join
In Spark Structured Streaming, a streaming join is a streaming query that was described
(build) using the high-level streaming operators:
Dataset.crossJoin
Dataset.join
Dataset.joinWith
Joins of a streaming query and a batch query (stream-static joins) are stateless and no
state management is required
Joins of two streaming queries (stream-stream joins) are stateful and require streaming
state (with an optional join state watermark for state removal).
Stream-Stream Joins
Spark Structured Streaming supports stream-stream joins with the following:
Equality predicate (i.e. equi-joins that use only equality comparisons in the join
predicate)
23
Streaming Join
A join state watermark can be specified on key state, value state or both.
IncrementalExecution — QueryExecution of Streaming
Queries
Under the covers, the high-level operators create a logical query plan with one or more
Join logical operators.
Tip Read up on Join Logical Operator in The Internals of Spark SQL online book.
Demos
Use the following demo application to learn more:
StreamStreamJoinApp
24
StateStoreAwareZipPartitionsRDD
StateStoreAwareZipPartitionsRDD
StateStoreAwareZipPartitionsRDD is a ZippedPartitionsRDD2 with the left and right parent
RDDs.
SparkContext
StatefulOperatorStateInfo
StateStoreCoordinatorRef
every state store (with the StatefulOperatorStateInfo and the partition ID) and returns unique
executor IDs (so that processing a partition happens on the executor with the proper state
store for the operator and the partition).
25
StateStoreAwareZipPartitionsRDD
26
SymmetricHashJoinStateManager
SymmetricHashJoinStateManager
SymmetricHashJoinStateManager is created for the left and right OneSideHashJoiners of a
of a stream-stream join).
the KeyWithIndexToValueStore state store handlers (and simply acts like their facade).
JoinSide
27
SymmetricHashJoinStateManager
StatefulOperatorStateInfo
StateStoreConf
Hadoop Configuration
removeByKeyCondition
removeByValueCondition
Performance metrics
28
SymmetricHashJoinStateManager
metrics: StateStoreMetrics
removeByKeyCondition Method
removeByKeyCondition(
removalCondition: UnsafeRow => Boolean): Iterator[UnsafeRowPair]
removeByKeyCondition uses the KeyToNumValuesStore for all state keys and values (in the
getNext(): UnsafeRowPair
getNext goes over the keys and values in the allKeyToNumValues sequence and removes
keys (from the KeyToNumValuesStore) and the corresponding values (from the
KeyWithIndexToValueStore) for which the given removalCondition predicate holds.
removeByValueCondition Method
removeByValueCondition(
removalCondition: UnsafeRow => Boolean): Iterator[UnsafeRowPair]
associated keys if needed) for which the given removalCondition predicate holds.
29
SymmetricHashJoinStateManager
getNext(): UnsafeRowPair
getNext …FIXME
append(
key: UnsafeRow,
value: UnsafeRow): Unit
append requests the KeyToNumValuesStore for the number of value rows for the given key.
KeyToNumValuesStore to store the given key with the number of value rows
incremented.
get requests the KeyToNumValuesStore for the number of value rows for the given key.
In the end, get requests the KeyWithIndexToValueStore to retrieve that number of value
rows for the given key and leaves value rows only.
30
SymmetricHashJoinStateManager
commit(): Unit
abortIfNeeded(): Unit
abortIfNeeded …FIXME
allStateStoreNames simply returns the names of the state stores for all possible
combinations of the given JoinSides and the two possible store types (e.g.
keyToNumValues and keyWithIndexToValue).
getStateStoreName(
joinSide: JoinSide,
storeType: StateStoreType): String
[joinSide]-[storeType]
31
SymmetricHashJoinStateManager
updateNumValueForCurrentKey(): Unit
updateNumValueForCurrentKey …FIXME
Internal Properties
Name Description
Key attributes, i.e. AttributeReferences of the key schema
Used exclusively in KeyWithIndexToValueStore when
keyAttributes
requested for the keyWithIndexExprs,
indexOrdinalInKeyWithIndexRow,
keyWithIndexRowGenerator and keyRowGenerator
Used when:
SymmetricHashJoinStateManager is requested for the key
keySchema
attributes (for KeyWithIndexToValueStore)
32
StateStoreHandler
stateStore: StateStore
stateStore
StateStore
Table 2. StateStoreHandlers
StateStoreHandler Description
KeyToNumValuesStore StateStoreHandler of KeyToNumValuesType
KeyWithIndexToValueStore
Refer to Logging.
metrics: StateStoreMetrics
33
StateStoreHandler
commit(): Unit
commit …FIXME
abortIfNeeded Method
abortIfNeeded(): Unit
abortIfNeeded …FIXME
getStateStore(
keySchema: StructType,
valueSchema: StructType): StateStore
owning SymmetricHashJoinStateManager , the partition ID from the execution context, and the
name of the state store for the JoinSide and StateStoreType).
StateStoreProviderId.
In the end, getStateStore prints out the following INFO message to the logs:
34
StateStoreHandler
Table 3. StateStoreTypes
StateStoreType toString Description
KeyToNumValuesType keyToNumValues
KeyWithIndexToValueType keyWithIndexToValue
StateStoreType is a Scala private sealed trait which means that all the
Note
implementations are in the same compilation unit (a single file).
35
KeyToNumValuesStore
KeyToNumValuesStore uses the schema for values in the state store with one field value (of
36
KeyToNumValuesStore
Refer to Logging.
get requests the StateStore for the value for the given key and returns the long value at
put(
key: UnsafeRow,
numValues: Long): Unit
put stores the numValues at the 0 th position (of the internal unsafe row) and requests
put requires that the numValues count is greater than 0 (or throws an
IllegalArgumentException ).
iterator: Iterator[KeyAndNumValues]
37
KeyToNumValuesStore
iterator simply requests the StateStore for all state keys and values.
38
KeyWithIndexToValueStore
KeyWithIndexToValueStore — State Store
(Handler) Of Join Keys With Index Of Values
KeyWithIndexToValueStore is a StateStoreHandler (of KeyWithIndexToValueType) for
KeyWithIndexToValueStore uses a schema (for the state store) that is the key schema (of the
39
KeyWithIndexToValueStore
Refer to Logging.
get(
key: UnsafeRow,
valueIndex: Long): UnsafeRow
get simply requests the internal state store to look up the value for the given key and
valueIndex.
getAll(
key: UnsafeRow,
numValues: Long): Iterator[KeyWithIndexAndValue]
getAll …FIXME
40
KeyWithIndexToValueStore
put(
key: UnsafeRow,
valueIndex: Long,
value: UnsafeRow): Unit
put …FIXME
remove Method
remove(
key: UnsafeRow,
valueIndex: Long): Unit
remove …FIXME
keyWithIndexRow(
key: UnsafeRow,
valueIndex: Long): UnsafeRow
removeAllValues Method
removeAllValues(
key: UnsafeRow,
numValues: Long): Unit
removeAllValues …FIXME
41
KeyWithIndexToValueStore
iterator Method
iterator: Iterator[KeyWithIndexAndValue]
iterator …FIXME
Internal Properties
Name Description
Position of the index in the key row (which
indexOrdinalInKeyWithIndexRow corresponds to the number of the key attributes)
Used exclusively in the keyWithIndexRow
42
OneSideHashJoiner
OneSideHashJoiner
OneSideHashJoiner manages join state of one side of a stream-stream join (using
SymmetricHashJoinStateManager).
operator (when requested to process partitions of the left and right sides of a stream-stream
join).
OneSideHashJoiner uses an optional join state watermark predicate to remove old state.
JoinSide
43
OneSideHashJoiner
JoinStateWatermarkPredicate
SymmetricHashJoinStateManager — joinStateManager
Internal Property
joinStateManager: SymmetricHashJoinStateManager
OneSideHashJoiner (with the join side, the input attributes, the join keys, and the
storeAndJoinWithOtherSide
commitStateAndGetMetrics
numUpdatedStateRows: Long
44
OneSideHashJoiner
stateWatermarkPredicate: Option[JoinStateWatermarkPredicate]
storeAndJoinWithOtherSide Method
storeAndJoinWithOtherSide(
otherSideJoiner: OneSideHashJoiner)(
generateJoinedRow: (InternalRow, InternalRow) => JoinedRow): Iterator[InternalRow]
storeAndJoinWithOtherSide tries to find the watermark attribute among the input attributes.
45
OneSideHashJoiner
join key) and the stateValueWatermarkPredicateFunc (on the current input row) to determine
whether to request the SymmetricHashJoinStateManager to append the key and the input
row (to a join state). If so, storeAndJoinWithOtherSide increments the
updatedStateRowsCount counter.
For LeftSide and LeftOuter , the join row is the current row with the values of the right
side all null ( nullRight )
For RightSide and RightOuter , the join row is the current row with the values of the left
side all null ( nullLeft )
For all other combinations, the iterator is simply empty (that will be removed from the
output by the outer nonLateRows.flatMap).
removeOldState(): Iterator[UnsafeRowPair]
For any other predicates, removeOldState returns an empty iterator (no rows to
process)
46
OneSideHashJoiner
get simply requests the SymmetricHashJoinStateManager to retrieve value rows for the
key.
commitStateAndGetMetrics(): StateStoreMetrics
Internal Properties
47
OneSideHashJoiner
Name Description
keyGenerator: UnsafeProjection
keyGenerator
Function to project (extract) join keys from an input row
Used when…FIXME
Used when…FIXME
48
JoinStateWatermarkPredicates
JoinStateWatermarkPredicates — Watermark
Predicates for State Removal
JoinStateWatermarkPredicates contains watermark predicates for state removal of the
for the state preparation rule to optimize and specify the execution-specific configuration
for a query plan with StreamingSymmetricHashJoinExec physical operators)
toString: String
toString uses the left and right predicates for the string representation:
49
JoinStateWatermarkPredicate
JoinStateWatermarkPredicates).
desc: String
desc
expr: Expression
50
JoinStateWatermarkPredicate
Table 2. JoinStateWatermarkPredicates
JoinStateWatermarkPredicate Description
toString: String
toString uses the desc and expr for the string representation:
[desc]: [expr]
51
StateStoreAwareZipPartitionsHelper
StateStoreAwareZipPartitionsHelper —
Extension Methods for Creating
StateStoreAwareZipPartitionsRDD
StateStoreAwareZipPartitionsHelper is a Scala implicit class of a data RDD (of type
Implicit Classes are a language feature in Scala for implicit conversions with
Note
extension methods for existing types.
Creating StateStoreAwareZipPartitionsRDD
— stateStoreAwareZipPartitions Method
52
StreamingSymmetricHashJoinHelper
StreamingSymmetricHashJoinHelper Utility
StreamingSymmetricHashJoinHelper is a Scala object with the following utility methods:
getStateWatermarkPredicates
Creating JoinStateWatermarkPredicates
— getStateWatermarkPredicates Object Method
getStateWatermarkPredicates(
leftAttributes: Seq[Attribute],
rightAttributes: Seq[Attribute],
leftKeys: Seq[Expression],
rightKeys: Seq[Expression],
condition: Option[Expression],
eventTimeWatermark: Option[Long]): JoinStateWatermarkPredicates
getStateWatermarkPredicates tries to find the index of the watermark attribute among the left
getStateWatermarkPredicates determines the state watermark predicate for the left side of a
join (for the given leftAttributes , the leftKeys and the rightAttributes ).
getStateWatermarkPredicates determines the state watermark predicate for the right side of
a join (for the given rightAttributes , the rightKeys and the leftAttributes ).
53
StreamingSymmetricHashJoinHelper
getOneSideStateWatermarkPredicate(
oneSideInputAttributes: Seq[Attribute],
oneSideJoinKeys: Seq[Expression],
otherSideInputAttributes: Seq[Attribute]): Option[JoinStateWatermarkPredicate]
attribute (the oneSideInputAttributes attributes, the left or right join keys) and creates a
JoinStateWatermarkPredicate as follows:
54
StreamingJoinHelper
StreamingJoinHelper Utility
StreamingJoinHelper is a Scala object with the following utility methods:
getStateValueWatermark
Refer to Logging.
getStateValueWatermark(
attributesToFindStateWatermarkFor: AttributeSet,
attributesWithEventWatermark: AttributeSet,
joinCondition: Option[Expression],
eventWatermark: Option[Long]): Option[Long]
getStateValueWatermark …FIXME
55
Extending Structured Streaming with New Data Sources
Spark SQL is migrating from the former Data Source API V1 to a new Data Source API V2,
and so is Structured Streaming. That is exactly the reason for BaseStreamingSource and
BaseStreamingSink APIs for the two different Data Source API’s class hierarchies, for
streaming sources and sinks, respectively.
Structured Streaming supports two stream execution engines (i.e. Micro-Batch and
Continuous) with their own APIs.
Micro-Batch Stream Processing supports the old Data Source API V1 and the new modern
Data Source API V2 with micro-batch-specific APIs for streaming sources and sinks.
Continuous Stream Processing supports the new modern Data Source API V2 only with
continuous-specific APIs for streaming sources and sinks.
The following are the questions to think of (and answer) while considering development of a
new data source for Structured Streaming. They are supposed to give you a sense of how
much work and time it takes as well as what Spark version to support (e.g. 2.2 vs 2.4).
56
BaseStreamingSource
BaseStreamingSource Contract — Base of
Streaming Readers and Sources
BaseStreamingSource is the abstraction of streaming readers and sources that can be
stopped.
void stop()
57
BaseStreamingSource
58
BaseStreamingSink
BaseStreamingSink Contract — Base of
Streaming Writers and Sinks
BaseStreamingSink is the abstraction of streaming writers and sinks with the only purpose of
sharing a common abstraction between the former Data Source API V1 (Sink API) and the
modern Data Source API V2 (until Spark Structured Streaming migrates to the Data Source
API V2 fully).
59
StreamWriteSupport
StreamWriteSupport Contract — Writable
Streaming Data Sources
StreamWriteSupport is the abstraction of DataSourceV2 sinks that create StreamWriters for
StreamWriter createStreamWriter(
String queryId,
StructType schema,
OutputMode mode,
DataSourceOptions options)
createStreamWriter creates a StreamWriter for streaming write and is used when the
stream execution thread for a streaming query is started and requests the stream execution
engines to start, i.e.
Table 1. StreamWriteSupports
StreamWriteSupport Description
ConsoleSinkProvider Streaming sink for console data source format
ForeachWriterProvider
KafkaSourceProvider
MemorySinkV2
60
StreamWriter
StreamWriter Contract
StreamWriter is the extension of the DataSourceWriter contract to support epochs, i.e.
streaming writers that can abort and commit writing jobs for a specified epoch.
void abort(
long epochId,
WriterCommitMessage[] messages)
abort
void commit(
long epochId,
WriterCommitMessage[] messages)
commit
Commits the writing job for a specified epochId and
WriterCommitMessages
Used when:
EpochCoordinator is requested to commitEpoch
Table 2. StreamWriters
StreamWriter Description
61
DataSource
SparkSession
className , i.e. the fully-qualified class name or an alias of the data source
sourceSchema(): SourceInfo
sourceSchema creates a new instance of the data source class and branches off per the
StreamSourceProvider
62
DataSource
In the end, sourceSchema returns the name and the schema as part of SourceInfo (with
partition columns unspecified).
FileFormat
For a FileFormat , sourceSchema …FIXME
Other Types
For any other data source type, sourceSchema simply throws an
UnsupportedOperationException :
createSource(
metadataPath: String): Source
createSource creates a new instance of the data source class and branches off per the
StreamSourceProvider
For a StreamSourceProvider, createSource requests the StreamSourceProvider to create a
source.
FileFormat
For a FileFormat , createSource creates a new FileStreamSource.
createSource throws an IllegalArgumentException when path option was not specified for
63
DataSource
Other Types
For any other data source type, createSource simply throws an
UnsupportedOperationException :
createSink(
outputMode: OutputMode): Sink
Internally, createSink creates a new instance of the providingClass and branches off per
type:
64
DataSource
Internal Properties
Name Description
providingClass
java.lang.Class for the className (that can be a fully-
qualified class name or an alias of the data source)
sourceInfo: SourceInfo
65
Demos
Demos
1. Demo: Internals of FlatMapGroupsWithStateExec Physical Operator
4. Demo: Streaming Query for Running Counts (Socket Source and Complete Output
Mode)
9. Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)
66
Internals of FlatMapGroupsWithStateExec Physical Operator
Demo: Internals of
FlatMapGroupsWithStateExec Physical
Operator
The following demo shows the internals of FlatMapGroupsWithStateExec physical operator
in a Arbitrary Stateful Streaming Aggregation.
values.printSchema
/**
root
|-- time: timestamp (nullable = true)
|-- value: long (nullable = false)
*/
import scala.concurrent.duration._
val delayThreshold = 10.seconds
val valuesWatermarked = values
.withWatermark(eventTime = "time", delayThreshold.toString) // required for EventTim
eTimeout
67
Internals of FlatMapGroupsWithStateExec Physical Operator
import java.sql.Timestamp
import org.apache.spark.sql.streaming.GroupState
val keyCounts = (key: Long, values: Iterator[(Timestamp, Long)], state: GroupState[Cou
nt]) => {
println(s""">>> keyCounts(key = $key, state = ${state.getOption.getOrElse("<empty>")
})""")
println(s">>> >>> currentProcessingTimeMs: ${state.getCurrentProcessingTimeMs}")
println(s">>> >>> currentWatermarkMs: ${state.getCurrentWatermarkMs}")
println(s">>> >>> hasTimedOut: ${state.hasTimedOut}")
val count = Count(values.length)
Iterator((key, count))
}
valuesCounted.explain
/**
== Physical Plan ==
*(2) Project [_1#928L AS value#931L, _2#929 AS count#932]
+- *(2) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#928L
, if (isnull(assertnotnull(input[0, scala.Tuple2, true])._2)) null else named_struct(v
alue, assertnotnull(assertnotnull(input[0, scala.Tuple2, true])._2).value) AS _2#929]
+- FlatMapGroupsWithState $line140.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$i
w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$
$iw$$iw$$iw$$iw$$Lambda$4117/181063008@d2cdc82, value#923: bigint, newInstance(class s
cala.Tuple2), [value#923L], [time#915-T10000ms, value#916L], obj#927: scala.Tuple2, st
ate info [ checkpoint = <unknown>, runId = 9af3d00c-fe1f-46a0-8630-4e0d0af88042, opId
= 0, ver = 0, numPartitions = 1], class[value[0]: bigint], 2, Update, EventTimeTimeout
, 0, 0
+- *(1) Sort [value#923L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value#923L, 1)
+- AppendColumns $line140.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$i
w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$
$iw$$iw$$iw$$iw$$Lambda$4118/2131767153@3e606b4c, newInstance(class scala.Tuple2), [in
put[0, bigint, false] AS value#923L]
+- EventTimeWatermark time#915: timestamp, interval 10 seconds
+- StreamingRelation MemoryStream[time#915,value#916L], [time#915, v
alue#916L]
*/
68
Internals of FlatMapGroupsWithStateExec Physical Operator
import org.apache.spark.sql.streaming.OutputMode.Update
val streamingQuery = valuesCounted
.writeStream
.format("memory")
.queryName(queryName)
.option("checkpointLocation", checkpointLocation)
.outputMode(Update)
.start
/**
>>> keyCounts(key = 1, state = <empty>)
>>> >>> currentProcessingTimeMs: 1561881557237
>>> >>> currentWatermarkMs: 0
>>> >>> hasTimedOut: false
>>> keyCounts(key = 2, state = <empty>)
>>> >>> currentProcessingTimeMs: 1561881557237
>>> >>> currentWatermarkMs: 0
>>> >>> hasTimedOut: false
*/
spark.table(queryName).show(truncate = false)
/**
+-----+-----+
|value|count|
+-----+-----+
69
Internals of FlatMapGroupsWithStateExec Physical Operator
|1 |[1] |
|2 |[1] |
+-----+-----+
*/
70
Internals of FlatMapGroupsWithStateExec Physical Operator
.asInstanceOf[StreamingQueryWrapper]
.streamingQuery
import org.apache.spark.sql.execution.streaming.IncrementalExecution
val lastMicroBatch: IncrementalExecution = engine.lastExecution
// Access executedPlan that is the optimized physical query plan ready for execution
// All streaming optimizations have been applied at this point
val plan = lastMicroBatch.executedPlan
// Display metrics
import org.apache.spark.sql.execution.metric.SQLMetric
def formatMetrics(name: String, metric: SQLMetric) = {
val desc = metric.name.getOrElse("")
val value = metric.value
f"| $name%-30s | $desc%-69s | $value%-10s"
}
flatMapOp.metrics.map { case (name, metric) => formatMetrics(name, metric) }.foreach(p
rintln)
/**
| numTotalStateRows | number of total state rows
| 0
| stateMemory | memory used by state total (min, med, max)
| 390
| loadedMapCacheHitCount | count of cache hit on states cache in provider
| 1
| numOutputRows | number of output rows
| 0
| stateOnCurrentVersionSizeBytes | estimated size of state only on current version tot
al (min, med, max) | 102
| loadedMapCacheMissCount | count of cache miss on states cache in provider
| 0
| commitTimeMs | time to commit changes total (min, med, max)
| -2
| allRemovalsTimeMs | total time to remove rows total (min, med, max)
| -2
| numUpdatedStateRows | number of updated state rows
| 0
| allUpdatesTimeMs | total time to update rows total (min, med, max)
| -2
*/
/**
71
Internals of FlatMapGroupsWithStateExec Physical Operator
spark.table(queryName).show(truncate = false)
/**
+-----+-----+
|value|count|
+-----+-----+
|1 |[1] |
|2 |[1] |
|3 |[1] |
+-----+-----+
*/
/**
>>> keyCounts(key = 3, state = <empty>)
>>> >>> currentProcessingTimeMs: 1561881672887
>>> >>> currentWatermarkMs: 5000
>>> >>> hasTimedOut: false
*/
spark.table(queryName).show(truncate = false)
/**
+-----+-----+
|value|count|
+-----+-----+
|1 |[1] |
|2 |[1] |
|3 |[1] |
|3 |[1] |
+-----+-----+
*/
72
Internals of FlatMapGroupsWithStateExec Physical Operator
/**
>>> keyCounts(key = 3, state = <empty>)
>>> >>> currentProcessingTimeMs: 1561881778165
>>> >>> currentWatermarkMs: 7000
>>> >>> hasTimedOut: false
*/
// Eventually...
streamingQuery.stop()
73
Arbitrary Stateful Streaming Aggregation with
KeyValueGroupedDataset.flatMapGroupsWithState Operator
import java.sql.Timestamp
type DeviceId = Long
case class Signal(timestamp: Timestamp, deviceId: DeviceId, value: Long)
// input stream
import org.apache.spark.sql.functions._
val signals = spark
.readStream
.format("rate")
.option("rowsPerSecond", 1)
.load
.withColumn("deviceId", rint(rand() * 10) cast "int") // 10 devices randomly assigne
d to values
.withColumn("value", $"value" % 10) // randomize the values (just for fun)
.as[Signal] // convert to our type (from "unpleasant" Row)
import org.apache.spark.sql.streaming.GroupState
type Key = Int
type Count = Long
type State = Map[Key, Count]
case class EventsCounted(deviceId: DeviceId, count: Long)
def countValuesPerDevice(
deviceId: Int,
signals: Iterator[Signal],
state: GroupState[State]): Iterator[EventsCounted] = {
val values = signals.toSeq
println(s"Device: $deviceId")
println(s"Signals (${values.size}):")
values.zipWithIndex.foreach { case (v, idx) => println(s"$idx. $v") }
println(s"State: $state")
// update the state with the count of elements for the key
val initialState: State = Map(deviceId -> 0)
val oldState = state.getOption.getOrElse(initialState)
// the name to highlight that the state is for the key only
val newValue = oldState(deviceId) + values.size
val newState = Map(deviceId -> newValue)
state.update(newState)
74
Arbitrary Stateful Streaming Aggregation with
KeyValueGroupedDataset.flatMapGroupsWithState Operator
75
Exploring Checkpointed State
The demo uses the state checkpoint directory that was used in Demo: Streaming Watermark
with Aggregation in Append Output Mode.
import org.apache.spark.sql.execution.streaming.state.StateStoreId
val storeId = StateStoreId(
checkpointRootLocation,
operatorId = 0,
partitionId = 0)
// The key and value schemas should match the watermark demo
// .groupBy(window($"time", windowDuration.toString) as "sliding_window")
import org.apache.spark.sql.types.{TimestampType, StructField, StructType}
val keySchema = StructType(
StructField("sliding_window",
StructType(
StructField("start", TimestampType, nullable = true) ::
StructField("end", TimestampType, nullable = true) :: Nil),
nullable = false) :: Nil)
scala> keySchema.printTreeString
root
|-- sliding_window: struct (nullable = false)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
76
Exploring Checkpointed State
import org.apache.spark.sql.execution.streaming.state.StateStoreProvider
val provider = StateStoreProvider.createAndInit(
storeId, keySchema, valueSchema, indexOrdinal, storeConf, hadoopConf)
import org.apache.spark.sql.execution.streaming.state.UnsafeRowPair
def formatRowPair(rowPair: UnsafeRowPair) = {
s"(${rowPair.key.getLong(0)}, ${rowPair.value.getLong(0)})"
}
store.iterator.map(formatRowPair).foreach(println)
77
Streaming Watermark with Aggregation in Append Output Mode
The demo also shows the behaviour and the internals of StateStoreSaveExec physical
operator in Append output mode.
values.printSchema
/**
root
|-- time: timestamp (nullable = true)
|-- value: long (nullable = false)
|-- batch: long (nullable = false)
*/
78
Streaming Watermark with Aggregation in Append Output Mode
import scala.concurrent.duration._
val delayThreshold = 10.seconds
val eventTime = "time"
countsPer5secWindow.printSchema
/**
root
|-- sliding_window: struct (nullable = false)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- batches: array (nullable = true)
| |-- element: long (containsNull = true)
|-- values: array (nullable = true)
| |-- element: long (containsNull = true)
*/
79
Streaming Watermark with Aggregation in Append Output Mode
80
Streaming Watermark with Aggregation in Append Output Mode
.option("checkpointLocation", checkpointLocation)
.outputMode(OutputMode.Append) // <-- Use Append output mode
.start
println(streamingQuery.lastProgress.stateOperators(0).prettyJson)
/**
{
"numRowsTotal" : 1,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 1102,
"customMetrics" : {
"loadedMapCacheHitCount" : 2,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : 414
}
}
*/
81
Streaming Watermark with Aggregation in Append Output Mode
82
Streaming Watermark with Aggregation in Append Output Mode
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val engine = streamingQuery
.asInstanceOf[StreamingQueryWrapper]
.streamingQuery
import org.apache.spark.sql.execution.streaming.StreamExecution
assert(engine.isInstanceOf[StreamExecution])
// Access executedPlan that is the optimized physical query plan ready for execution
// All streaming optimizations have been applied at this point
// We just need the EventTimeWatermarkExec physical operator
val plan = lastMicroBatch.executedPlan
println(stats)
/**
EventTimeStats(-9223372036854775808,9223372036854775807,0.0,0)
*/
83
Streaming Watermark with Aggregation in Append Output Mode
println(stats)
/**
EventTimeStats(-9223372036854775808,9223372036854775807,0.0,0)
*/
84
Streaming Watermark with Aggregation in Append Output Mode
println(stats)
/**
EventTimeStats(26000,15000,19000.0,4)
*/
85
Streaming Watermark with Aggregation in Append Output Mode
println(stats)
/**
EventTimeStats(-9223372036854775808,9223372036854775807,0.0,0)
*/
86
Streaming Watermark with Aggregation in Append Output Mode
|sliding_window |batches|values|
+------------------------------------------+-------+------+
|[1970-01-01 01:00:00, 1970-01-01 01:00:05]|[1] |[1] |
|[1970-01-01 01:00:15, 1970-01-01 01:00:20]|[1, 2] |[2, 2]|
|[1970-01-01 01:00:25, 1970-01-01 01:00:30]|[3] |[4] |
|[1970-01-01 01:00:35, 1970-01-01 01:00:40]|[2, 4] |[3, 1]|
+------------------------------------------+-------+------+
*/
println(stats)
/**
EventTimeStats(-9223372036854775808,9223372036854775807,0.0,0)
*/
// Eventually...
streamingQuery.stop()
87
Streaming Query for Running Counts (Socket Source and Complete Output Mode)
assert(spark.sessionState.conf.numShufflePartitions == numShufflePartitions)
// END: Only for easier debugging
scala> lines.printSchema
root
|-- value: string (nullable = true)
import org.apache.spark.sql.functions.explode
val words = lines
.select(explode(split($"value", """\W+""")) as "word")
scala> counts.printSchema
root
|-- word: string (nullable = true)
|-- count: long (nullable = false)
88
Streaming Query for Running Counts (Socket Source and Complete Output Mode)
import org.apache.spark.sql.streaming.OutputMode.Complete
val runningCounts = counts
.writeStream
.format("console")
.option("checkpointLocation", checkpointLocation)
.outputMode(Complete)
.start
scala> runningCounts.explain
== Physical Plan ==
WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@
205f195c
+- *(5) HashAggregate(keys=[word#72], functions=[count(1)])
+- StateStoreSave [word#72], state info [ checkpoint = file:/tmp/checkpoint-running
_counts/state, runId = f3b2e642-1790-4a17-ab61-3d894110b063, opId = 0, ver = 0, numPar
titions = 1], Complete, 0, 2
+- *(4) HashAggregate(keys=[word#72], functions=[merge_count(1)])
+- StateStoreRestore [word#72], state info [ checkpoint = file:/tmp/checkpoin
t-running_counts/state, runId = f3b2e642-1790-4a17-ab61-3d894110b063, opId = 0, ver = 0
, numPartitions = 1], 2
+- *(3) HashAggregate(keys=[word#72], functions=[merge_count(1)])
+- Exchange hashpartitioning(word#72, 1)
+- *(2) HashAggregate(keys=[word#72], functions=[partial_count(1)])
+- Generate explode(split(value#83, \W+)), false, [word#72]
+- *(1) Project [value#83]
+- *(1) ScanV2 socket[value#83] (Options: [host=localhost,p
ort=9999])
89
Streaming Query for Running Counts (Socket Source and Complete Output Mode)
// in /tmp/checkpoint-running_counts/state/0/0
// Eventually...
runningCounts.stop()
90
Streaming Aggregation with Kafka Data Source
You may want to consider copying the following code to append.txt and using
Tip :load append.txt command in spark-shell to load it (rather than copying and
pasting it).
assert(spark.sessionState.conf.numShufflePartitions == numShufflePartitions)
// END: Only for easier debugging
91
Streaming Aggregation with Kafka Data Source
as to be a timestamp
.withColumn("id", 'tokens(1))
.withColumn("batch", 'tokens(2) cast "int")
.withWatermark(eventTime = "event_time", delayThreshold = "10 seconds") // <-- defin
e watermark (before groupBy!)
.groupBy($"event_time") // <-- use event_time for grouping
.agg(collect_list("batch") as "batches", collect_list("id") as "ids")
.withColumn("event_time", to_timestamp($"event_time")) // <-- convert to human-reada
ble date
scala> ids.printSchema
root
|-- event_time: timestamp (nullable = true)
|-- batches: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- ids: array (nullable = true)
| |-- element: string (containsNull = true)
// ids knows nothing about the output mode or the current streaming watermark yet
// - Output mode is defined on writing side
// - streaming watermark is read from rows at runtime
// That's why StatefulOperatorStateInfo is generic (and uses the default Append for ou
tput mode)
// and no batch-specific values are printed out
// They will be available right after the first streaming batch
// Use explain on a streaming query to know the trigger-specific values
scala> ids.explain
== Physical Plan ==
ObjectHashAggregate(keys=[event_time#118-T10000ms], functions=[collect_list(batch#141,
0, 0), collect_list(id#129, 0, 0)])
+- StateStoreSave [event_time#118-T10000ms], state info [ checkpoint = <unknown>, runI
d = a870e6e2-b925-4104-9886-b211c0be1b73, opId = 0, ver = 0, numPartitions = 1], Append
, 0, 2
+- ObjectHashAggregate(keys=[event_time#118-T10000ms], functions=[merge_collect_lis
t(batch#141, 0, 0), merge_collect_list(id#129, 0, 0)])
+- StateStoreRestore [event_time#118-T10000ms], state info [ checkpoint = <unkno
wn>, runId = a870e6e2-b925-4104-9886-b211c0be1b73, opId = 0, ver = 0, numPartitions = 1
], 2
+- ObjectHashAggregate(keys=[event_time#118-T10000ms], functions=[merge_colle
ct_list(batch#141, 0, 0), merge_collect_list(id#129, 0, 0)])
+- Exchange hashpartitioning(event_time#118-T10000ms, 1)
+- ObjectHashAggregate(keys=[event_time#118-T10000ms], functions=[parti
al_collect_list(batch#141, 0, 0), partial_collect_list(id#129, 0, 0)])
+- EventTimeWatermark event_time#118: timestamp, interval 10 seconds
+- *(1) Project [cast(from_unixtime(cast(split(cast(value#8 as st
ring), ,)[0] as bigint), yyyy-MM-dd HH:mm:ss, Some(Europe/Warsaw)) as timestamp) AS ev
ent_time#118, split(cast(value#8 as string), ,)[1] AS id#129, cast(split(cast(value#8
as string), ,)[2] as int) AS batch#141]
+- StreamingRelation kafka, [key#7, value#8, topic#9, partitio
n#10, offset#11L, timestamp#12, timestampType#13]
92
Streaming Aggregation with Kafka Data Source
scala> println(lastProgress.stateOperators.head.prettyJson)
{
"numRowsTotal" : 1,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 742,
"customMetrics" : {
"loadedMapCacheHitCount" : 1,
"loadedMapCacheMissCount" : 1,
"stateOnCurrentVersionSizeBytes" : 374
}
}
93
Streaming Aggregation with Kafka Data Source
// Eventually...
streamingQuery.stop()
94
groupByKey Streaming Aggregation in Update Mode
95
groupByKey Streaming Aggregation in Update Mode
package pl.japila.spark.examples
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
// Processing logic
// groupByKey + count
val byZipCode = (o: Order) => o.zipCode
val ordersByZipCode = orders.groupByKey(byZipCode)
import org.apache.spark.sql.functions.count
val typedCountCol = (count("zipCode") as "count").as[String]
val counts = ordersByZipCode
.agg(typedCountCol)
.select($"value" as "zip_code", $"count")
96
groupByKey Streaming Aggregation in Update Mode
Credits
The example with customer orders and postal codes is borrowed from Apache Beam’s
Using GroupByKey Programming Guide.
97
StateStoreSaveExec with Complete Output Mode
98
StateStoreSaveExec with Complete Output Mode
+- ObjectHashAggregate(keys=[group#25], functions=[merge_collect_list(valu
e#36, 0, 0)])
+- Exchange hashpartitioning(group#25, 1)
+- StateStoreRestore [group#25], StatefulOperatorStateInfo(<unknown>,
899f0fd1-b202-45cd-9ebd-09101ca90fa8,0,0)
+- ObjectHashAggregate(keys=[group#25], functions=[merge_collect_
list(value#36, 0, 0)])
+- Exchange hashpartitioning(group#25, 1)
+- ObjectHashAggregate(keys=[group#25], functions=[partial_
collect_list(value#36, 0, 0)])
+- *Project [split(cast(value#1 as string), ,)[0] AS gro
up#25, cast(split(cast(value#1 as string), ,)[1] as int) AS value#36]
+- StreamingRelation kafka, [key#0, value#1, topic#2,
partition#3, offset#4L, timestamp#5, timestampType#6]
-------------------------------------------
Batch: 0
-------------------------------------------
+-----+------+
|group|values|
+-----+------+
+-----+------+
// there's only 1 stateful operator and hence 0 for the index in stateOperators
scala> println(sq.lastProgress.stateOperators(0).prettyJson)
{
"numRowsTotal" : 0,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 60
}
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+------+
|group|values|
+-----+------+
|0 |[1] |
99
StateStoreSaveExec with Complete Output Mode
+-----+------+
// it's Complete output mode so numRowsTotal is the number of keys in the state store
// no keys were available earlier (it's just started!) and so numRowsUpdated is 0
scala> println(sq.lastProgress.stateOperators(0).prettyJson)
{
"numRowsTotal" : 1,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 324
}
-------------------------------------------
Batch: 2
-------------------------------------------
+-----+------+
|group|values|
+-----+------+
|0 |[2, 1]|
|1 |[1] |
+-----+------+
// it's Complete output mode so numRowsTotal is the number of keys in the state store
// no keys were available earlier and so numRowsUpdated is...0?!
// Think it's a BUG as it should've been 1 (for the row 0,2)
// 8/30 Sent out a question to the Spark user mailing list
scala> println(sq.lastProgress.stateOperators(0).prettyJson)
{
"numRowsTotal" : 2,
"numRowsUpdated" : 0,
"memoryUsedBytes" : 572
}
// In the end...
sq.stop
100
StateStoreSaveExec with Update Output Mode
101
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)
The main motivation was to answer the question Why does a single structured
query run multiple SQL queries per batch? that happened to have turned out
fairly surprising.
Note
You’re very welcome to upvote the question and answers at your earliest
convenience. Thanks!
2. Creating StreamSinkProvider — DemoSinkProvider
4. build.sbt Definition
5. Packaging DemoSink
1. Custom sinks require that you define a checkpoint location using checkpointLocation
option (or spark.sql.streaming.checkpointLocation Spark property). Remove the
checkpoint directory (or use a different one every start of a streaming query) to have
consistent results.
102
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)
package pl.japila.spark.sql.streaming
Creating StreamSinkProvider — DemoSinkProvider
package pl.japila.spark.sql.streaming
103
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)
pl.japila.spark.sql.streaming.DemoSinkProvider
build.sbt Definition
If you use my beloved build tool sbt to manage the project, use the following build.sbt .
organization := "pl.japila.spark"
name := "spark-structured-streaming-demo-sink"
version := "0.1"
scalaVersion := "2.11.11"
Packaging DemoSink
The step depends on what build tool you use to manage the project. Use whatever
command you use to create a jar file with the above classes compiled and bundled together.
$ sbt package
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /Users/jacek/dev/sandbox/spark-structured-strea
ming-demo-sink/project
[info] Loading settings from build.sbt ...
[info] Set current project to spark-structured-streaming-demo-sink (in build file:/Use
rs/jacek/dev/sandbox/spark-structured-streaming-demo-sink/)
[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/spark-structured-streaming
-demo-sink/target/scala-2.11/classes ...
[info] Done compiling.
[info] Packaging /Users/jacek/dev/sandbox/spark-structured-streaming-demo-sink/target/
scala-2.11/spark-structured-streaming-demo-sink_2.11-0.1.jar ...
[info] Done packaging.
[success] Total time: 5 s, completed Sep 12, 2017 9:34:19 AM
104
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)
The following code reads data from the rate source and simply outputs the result to our
custom DemoSink .
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
val sq = spark.
readStream.
format("rate").
load.
writeStream.
format("demo").
option("checkpointLocation", "/tmp/demo-checkpoint").
trigger(Trigger.ProcessingTime(10.seconds)).
start
// In the end...
scala> sq.stop
17/09/12 09:59:28 INFO StreamExecution: Query [id = 03cd78e3-94e2-439c-9c12-cfed0c9968
12, runId = 6938af91-9806-4404-965a-5ae7525d5d3f] was stopped
You should find that every trigger (aka batch) results in 3 SQL queries. Why?
105
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)
Figure 1. web UI’s SQL Tab and Completed Queries (3 Queries per Batch)
The answer lies in what sources and sink a streaming query uses (and differs per streaming
query).
In our case, DemoSink collects the rows from the input DataFrame and shows it
afterwards. That gives 2 SQL queries (as you can see after executing the following batch
queries).
The remaining query (which is the first among the queries) is executed when you load the
data.
That can be observed easily when you change DemoSink to not "touch" the input data (in
addBatch ) in any way.
Re-run the streaming query (using the new DemoSink ) and use web UI’s SQL tab to see the
queries. You should have just one query per batch (and no Spark jobs given nothing is really
done in the sink’s addBatch ).
106
Developing Custom Streaming Sink (and Monitoring SQL Queries in web UI)
Figure 2. web UI’s SQL Tab and Completed Queries (1 Query per Batch)
107
current_timestamp Function For Processing Time in Streaming Queries
The main motivation was to answer the question How to achieve ingestion
time? in Spark Structured Streaming.
Note
You’re very welcome to upvote the question and answers at your earliest
convenience. Thanks!
Event time is the time that each individual event occurred on its producing device. This
time is typically embedded within the records before they enter Flink and that event
timestamp can be extracted from the record.
That is exactly how event time is considered in withWatermark operator which you use to
describe what column to use for event time. The column could be part of the input dataset
or…generated.
In order to generate the event time column for withWatermark operator you could use
current_timestamp or current_date standard functions.
Both are special for Spark Structured Streaming as StreamExecution replaces their
underlying Catalyst expressions, CurrentTimestamp and CurrentDate respectively, with
CurrentBatchTimestamp expression and the time of the current batch.
108
current_timestamp Function For Processing Time in Streaming Queries
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = values.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
start
-------------------------------------------
Batch: 1
-------------------------------------------
+-----------------------+-----+-------------------+
|timestamp |value|current_timestamp |
+-----------------------+-----+-------------------+
|2017-09-18 10:53:31.523|0 |2017-09-18 10:53:40|
|2017-09-18 10:53:32.523|1 |2017-09-18 10:53:40|
|2017-09-18 10:53:33.523|2 |2017-09-18 10:53:40|
|2017-09-18 10:53:34.523|3 |2017-09-18 10:53:40|
|2017-09-18 10:53:35.523|4 |2017-09-18 10:53:40|
|2017-09-18 10:53:36.523|5 |2017-09-18 10:53:40|
|2017-09-18 10:53:37.523|6 |2017-09-18 10:53:40|
|2017-09-18 10:53:38.523|7 |2017-09-18 10:53:40|
+-----------------------+-----+-------------------+
// Use web UI's SQL tab for the batch (Submitted column)
// or sq.recentProgress
scala> println(sq.recentProgress(1).timestamp)
2017-09-18T08:53:40.000Z
// Note current_batch_timestamp
== Physical Plan ==
109
current_timestamp Function For Processing Time in Streaming Queries
That seems to be closer to processing time than ingestion time given the definition from the
Apache Flink documentation:
Processing time refers to the system time of the machine that is executing the
respective operation.
110
Using StreamingQueryManager for Query Termination Management
demo-StreamingQueryManager.scala
import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
111
Using StreamingQueryManager for Query Termination Management
start
/*
-------------------------------------------
Batch: 7
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-10-27 13:44:07.462|21 |
|2017-10-27 13:44:08.462|22 |
|2017-10-27 13:44:09.462|23 |
|2017-10-27 13:44:10.462|24 |
+-----------------------+-----+
-------------------------------------------
Batch: 8
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-10-27 13:44:11.462|25 |
|2017-10-27 13:44:12.462|26 |
|2017-10-27 13:44:13.462|27 |
|2017-10-27 13:44:14.462|28 |
+-----------------------+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-10-27 13:44:09.847|6 |
|2017-10-27 13:44:10.847|7 |
|2017-10-27 13:44:11.847|8 |
|2017-10-27 13:44:12.847|9 |
|2017-10-27 13:44:13.847|10 |
|2017-10-27 13:44:14.847|11 |
|2017-10-27 13:44:15.847|12 |
|2017-10-27 13:44:16.847|13 |
|2017-10-27 13:44:17.847|14 |
|2017-10-27 13:44:18.847|15 |
+-----------------------+-----+
*/
112
Using StreamingQueryManager for Query Termination Management
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit.SECONDS
def queryTerminator(query: StreamingQuery) = new Runnable {
def run = {
println(s"Stopping streaming query: ${query.id}")
query.stop
}
}
import java.util.concurrent.TimeUnit.SECONDS
// Stop the first query after 10 seconds
Executors.newSingleThreadScheduledExecutor.
scheduleWithFixedDelay(queryTerminator(q4s), 10, 60 * 5, SECONDS)
// Stop the other query after 20 seconds
Executors.newSingleThreadScheduledExecutor.
scheduleWithFixedDelay(queryTerminator(q10s), 20, 60 * 5, SECONDS)
spark.streams.awaitAnyTermination
// You are here only after either streaming query has finished
// Executing spark.streams.awaitAnyTermination again would return immediately
// You should have received the QueryTerminatedEvent for the query termination
assert(spark.streams.active.isEmpty)
// leave spark-shell
System.exit(0)
113
Streaming Aggregation
Streaming Aggregation
In Spark Structured Streaming, a streaming aggregation is a streaming query that was
described (build) using the following high-level streaming operators:
IncrementalExecution — QueryExecution of Streaming
Queries
Under the covers, the high-level operators create a logical query plan with one or more
Aggregate logical operators.
Tip Read up on Aggregate logical operator in The Internals of Spark SQL book.
counts.explain(extended = true)
/**
== Parsed Logical Plan ==
114
Streaming Aggregation
== Physical Plan ==
*(4) HashAggregate(keys=[(value#16L % 2)#27L], functions=[count(1)], output=[(value %
2)#23L, count#22L])
+- StateStoreSave [(value#16L % 2)#27L], state info [ checkpoint = <unknown>, runId =
8c0ae2be-5eaa-4038-bc29-a176abfaf885, opId = 0, ver = 0, numPartitions = 200], Append,
0, 2
+- *(3) HashAggregate(keys=[(value#16L % 2)#27L], functions=[merge_count(1)], outpu
t=[(value#16L % 2)#27L, count#29L])
+- StateStoreRestore [(value#16L % 2)#27L], state info [ checkpoint = <unknown>,
runId = 8c0ae2be-5eaa-4038-bc29-a176abfaf885, opId = 0, ver = 0, numPartitions = 200]
, 2
+- *(2) HashAggregate(keys=[(value#16L % 2)#27L], functions=[merge_count(1)],
output=[(value#16L % 2)#27L, count#29L])
+- Exchange hashpartitioning((value#16L % 2)#27L, 200)
+- *(1) HashAggregate(keys=[(value#16L % 2) AS (value#16L % 2)#27L], fu
nctions=[partial_count(1)], output=[(value#16L % 2)#27L, count#29L])
+- *(1) Project [value#16L]
+- StreamingRelation rate, [timestamp#15, value#16L]
*/
Demos
Use the following demos to learn more:
Streaming Query for Running Counts (Socket Source and Complete Output Mode)
115
Streaming Aggregation
116
StateStoreRDD
StateStoreOps.mapPartitionsWithStateStore):
FlatMapGroupsWithStateExec
StateStoreRestoreExec
StateStoreSaveExec
StreamingDeduplicateExec
StreamingGlobalLimitExec
scheduling.
117
StateStoreRDD
compute(
partition: Partition,
ctxt: TaskContext): Iterator[U]
configured StateStore).
compute then requests StateStore for the store for the StateStoreProviderId.
In the end, compute computes dataRDD (using the input partition and ctxt ) followed by
executing storeUpdateFunction (with the store and the result).
checkpointLocation, operatorId and the index of the input partition ) and queryRunId.
118
StateStoreRDD
Checkpoint directory
Operator ID
Index
SessionState
Optional StateStoreCoordinatorRef
Internal Properties
Name Description
hadoopConfBroadcast
storeConf
Configuration parameters (as StateStoreConf ) using the
current SQLConf (from SessionState )
119
StateStoreOps
FlatMapGroupsWithStateExec
StateStoreRestoreExec
StateStoreSaveExec
StreamingDeduplicateExec
Implicit Classes are a language feature in Scala for implicit conversions with
Note
extension methods for existing types.
mapPartitionsWithStateStore[U](
stateInfo: StatefulOperatorStateInfo,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int],
sessionState: SessionState,
storeCoordinator: Option[StateStoreCoordinatorRef])(
storeUpdateFunction: (StateStore, Iterator[T]) => Iterator[U]): StateStoreRDD[T, U]
// Used for testing only
mapPartitionsWithStateStore[U](
sqlContext: SQLContext,
stateInfo: StatefulOperatorStateInfo,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int])(
storeUpdateFunction: (StateStore, Iterator[T]) => Iterator[U]): StateStoreRDD[T, U] (
1)
120
StateStoreOps
state updates had not been committed before a task finished (which is to make sure that the
StateStore has been committed or aborted in the end to follow the contract of
StateStore ).
StreamingDeduplicateExec
StreamingGlobalLimitExec
121
StreamingAggregationStateManager
StreamingAggregationStateManager Contract
— State Managers for Streaming Aggregation
StreamingAggregationStateManager is the abstraction of state managers that act as
middlemen between state stores and the physical operators used in Streaming Aggregation
(e.g. StateStoreSaveExec and StateStoreRestoreExec).
commit(
store: StateStore): Long
get Looks up the value of the key from the state store (the
key is non- null )
Used exclusively when StateStoreRestoreExec physical
operator is executed.
Extracts the columns for the key from the input row
Used when:
getKey
StateStoreRestoreExec physical operator is
executed
StreamingAggregationStateManagerImplV1 legacy
state manager is requested to put a row to a state
store
getStateValueSchema: StructType
122
StreamingAggregationStateManager
iterator(
store: StateStore): Iterator[UnsafeRowPair]
put(
store: StateStore,
row: UnsafeRow): Unit
put
Stores (puts) the given row in the given state store
Used exclusively when StateStoreSaveExec physical
operator is executed.
remove(
store: StateStore,
key: UnsafeRow): Unit
remove Removes the key-value pair from the given state store
per key
values(
store: StateStore): Iterator[UnsafeRow]
values
All values in the state store
123
StreamingAggregationStateManager
createStateManager(
keyExpressions: Seq[Attribute],
inputRowAttributes: Seq[Attribute],
stateFormatVersion: Int): StreamingAggregationStateManager
stateFormatVersion :
124
StreamingAggregationStateManagerBaseImpl
StreamingAggregationStateManagerBaseImpl
— Base State Manager for Streaming
Aggregation
StreamingAggregationStateManagerBaseImpl is the base implementation of the
Table 1. StreamingAggregationStateManagerBaseImpls
StreamingAggregationStateManagerBaseImpl Description
Legacy StreamingAggregationStateManager
when
StreamingAggregationStateManagerImplV1
spark.sql.streaming.aggregation.stateFormatVersi
configuration property is 1 )
Default StreamingAggregationStateManager
when
StreamingAggregationStateManagerImplV2
spark.sql.streaming.aggregation.stateFormatVersi
configuration property is 2 )
commit(
store: StateStore): Long
125
StreamingAggregationStateManagerBaseImpl
remove …FIXME
getKey Method
getKey …FIXME
keys …FIXME
126
StreamingAggregationStateManagerImplV1
StreamingAggregationStateManagerImplV1 —
Legacy State Manager for Streaming
Aggregation
StreamingAggregationStateManagerImplV1 is the legacy state manager for streaming
aggregations.
StreamingAggregationStateManager.
put …FIXME
Creating StreamingAggregationStateManagerImplV1
Instance
StreamingAggregationStateManagerImplV1 takes the following when created:
127
StreamingAggregationStateManagerImplV2
StreamingAggregationStateManagerImplV2 —
Default State Manager for Streaming
Aggregation
StreamingAggregationStateManagerImplV2 is the default state manager for streaming
aggregations.
StreamingAggregationStateManager.
put …FIXME
get requests the given StateStore for the current state value for the given key.
128
StreamingAggregationStateManagerImplV2
get returns null if the key could not be found in the state store. Otherwise, get
restoreOriginalRow …FIXME
getStateValueSchema Method
getStateValueSchema: StructType
iterator Method
iterator simply requests the input state store for the iterator that is mapped to an iterator
of UnsafeRowPairs with the key (of the input UnsafeRowPair ) and the value as a restored
original row.
values Method
129
StreamingAggregationStateManagerImplV2
values …FIXME
Internal Properties
Name Description
joiner
keyValueJoinedExpressions
needToProjectToRestoreValue
restoreValueProjector
valueExpressions
valueProjector
130
Stateful Stream Processing
In Spark Structured Streaming, a streaming query is stateful when is one of the following
(that makes use of StateStores):
Streaming Aggregation
Stream-Stream Join
Streaming Deduplication
Streaming Limit
State stores are checkpointed incrementally to avoid state loss and for increased
performance.
State store providers manage versioned state per stateful operator (and partition it operates
on).
The lifecycle of a StateStoreProvider begins when StateStore utility (on a Spark executor)
is requested for the StateStore by provider ID and version.
131
Stateful Stream Processing
When requested for a StateStore, StateStore utility is given the version of a state store to
look up. The version is either the current epoch (in Continuous Stream Processing) or the
current batch ID (in Micro-Batch Stream Processing).
StateStoreProvider utility then requests the StateStoreProvider for the state store for a
IncrementalExecution — QueryExecution of Streaming
Queries
Regardless of the query language (Dataset API or SQL), any structured query (incl.
streaming queries) becomes a logical query plan.
While planning a streaming query for execution (aka query planning), IncrementalExecution
uses the state preparation rule. The rule fills out the following physical operators with the
execution-specific configuration (with StatefulOperatorStateInfo being the most important for
stateful stream processing):
FlatMapGroupsWithStateExec
StateStoreRestoreExec
StreamingDeduplicateExec
StreamingGlobalLimitExec
StreamingSymmetricHashJoinExec
132
Stateful Stream Processing
stateful physical operators to indicate whether the last batch execution requires another non-
data batch.
The following table shows the StateStoreWriters that redefine shouldRunAnotherBatch flag.
StateStoreRDD
Right after query planning, a stateful streaming query (a single micro-batch actually)
becomes an RDD with one or more StateStoreRDDs.
You can find the StateStoreRDDs of a streaming query in the RDD lineage.
scala> streamingQuery.explain
== Physical Plan ==
*(4) HashAggregate(keys=[window#13-T0ms, value#3L], functions=[count(1)])
+- StateStoreSave [window#13-T0ms, value#3L], state info [ checkpoint = file:/tmp/chec
kpoint-counts/state, runId = 1dec2d81-f2d0-45b9-8f16-39ede66e13e7, opId = 0, ver = 1,
numPartitions = 1], Append, 10000, 2
+- *(3) HashAggregate(keys=[window#13-T0ms, value#3L], functions=[merge_count(1)])
+- StateStoreRestore [window#13-T0ms, value#3L], state info [ checkpoint = file:
/tmp/checkpoint-counts/state, runId = 1dec2d81-f2d0-45b9-8f16-39ede66e13e7, opId = 0,
ver = 1, numPartitions = 1], 2
+- *(2) HashAggregate(keys=[window#13-T0ms, value#3L], functions=[merge_count(
1)])
+- Exchange hashpartitioning(window#13-T0ms, value#3L, 1)
+- *(1) HashAggregate(keys=[window#13-T0ms, value#3L], functions=[parti
al_count(1)])
133
Stateful Stream Processing
scala> :type se
org.apache.spark.sql.execution.streaming.StreamExecution
134
Stateful Stream Processing
When planned for execution, the StateStoreRDD is first asked for the preferred locations of a
partition (which happens on the driver) that are later used to compute it (on Spark
executors).
Spark Structured Streaming uses RPC environment to keep track of StateStores (their
StateStoreProvider actually) for RDD planning.
State Management
The state in a stateful streaming query can be implicit or explicit.
135
Streaming Watermark
Streaming Watermark
Streaming Watermark of a stateful streaming query is how long to wait for late and possibly
out-of-order events until a streaming state can be considered final and not to change.
Streaming watermark is used to mark events (modeled as a row in the streaming Dataset)
that are older than the threshold as "too late", and not "interesting" to update partial non-final
streaming state.
withWatermark(
eventTime: String,
delayThreshold: String): Dataset[T]
Watermark Delay says how late and possibly out-of-order events are still acceptable and
contribute to the final result of a stateful streaming query. Event-time watermark delay is
used to calculate the difference between the event time of an event and the time in the past.
Event-Time Watermark is then a time threshold (point in time) that is the minimum
acceptable time of an event (modeled as a row in the streaming Dataset) that is accepted in
a stateful streaming query.
With streaming watermark, memory usage of a streaming state can be controlled as late
events can easily be dropped, and old state (e.g. aggregates or join) that are never going to
be updated removed. That avoids unbounded streaming state that would inevitably use up
all the available memory of long-running streaming queries and end up in out of memory
errors.
In Append output mode the current event-time streaming watermark is used for the
following:
Output saved state rows that became expired (Expired events in the demo)
Dropping late events, i.e. don’t save them to a state store or include in aggregation
(Late events in the demo)
136
Streaming Watermark
Streaming Aggregation
In streaming aggregation, a streaming watermark has to be defined on one or many
grouping expressions of a streaming aggregation (directly or using window standard
function).
Streaming Join
In streaming join, a streaming watermark can be defined on join keys or any of the join
sides.
Demos
Use the following demos to learn more:
Internals
Under the covers, Dataset.withWatermark high-level operator creates a logical query plan
with EventTimeWatermark logical operator.
operator that extracts the event times (from the data being processed) and adds them to an
accumulator.
Since the execution (data processing) happens on Spark executors, using the accumulator
is the only Spark-approved way for communication between the tasks (on the executors)
and the driver. Using accumulator updates the driver with the current event-time watermark.
During the query planning phase (in MicroBatchExecution and ContinuousExecution) that
also happens on the driver, IncrementalExecution is given the current OffsetSeqMetadata
with the current event-time watermark.
137
Streaming Watermark
138
Streaming Deduplication
Streaming Deduplication
Streaming Deduplication is…FIXME
139
Streaming Limit
Streaming Limit
Streaming Limit is…FIXME
140
StateStore
StateStore supports incremental checkpointing in which only the key-value "Row" pairs
that changed are committed or aborted (without touching other key-value pairs).
StateStore is identified with the aggregating operator id and the partition id (among other
abort(): Unit
commit(): Long
Used when:
FlatMapGroupsWithStateExec,
StreamingDeduplicateExec and
commit
StreamingGlobalLimitExec physical operators are
executed (right after all rows in a partition have been
processed)
141
StateStore
StreamingAggregationStateManagerBaseImpl is requested
to commit (changes to) a state store (exclusively when
StateStoreSaveExec physical operator is executed)
StateStoreHandler (of
SymmetricHashJoinStateManager) is requested to
commit changes to a state store
Used when:
StreamingDeduplicateExec and
StreamingGlobalLimitExec physical operators are
executed
get
StateManagerImplBase (of
FlatMapGroupsWithStateExecHelper ) is requested to
getState
StreamingAggregationStateManagerImplV1 and
StreamingAggregationStateManagerImplV2 are
requested to get the value of a non-null key
KeyToNumValuesStore is requested to get
getRange(
start: Option[UnsafeRow],
end: Option[UnsafeRow]): Iterator[UnsafeRowPair]
Used when:
WatermarkSupport is requested to
getRange removeKeysOlderThanWatermark
StateManagerImplBase is requested to getAllState
StreamingAggregationStateManagerBaseImpl is requested
to keys
KeyToNumValuesStore and KeyWithIndexToValueStore
are requested to iterator
142
StateStore
hasCommitted: Boolean
Used when:
hasCommitted
RDD (via StateStoreOps implicit class) is requested to
mapPartitionsWithStateStore (and a task finishes and
may need to abort state updates)
SymmetricHashJoinStateManager is requested to
abortIfNeeded (when a task finishes and may need to
abort state updates))
id: StateStoreId
iterator(): Iterator[UnsafeRowPair]
metrics: StateStoreMetrics
143
StateStore
metrics
Used when:
StateStoreWriter stateful physical operator is
requested to setStoreMetrics
StateStoreHandler (of
SymmetricHashJoinStateManager) is requested to
commit and for the metrics
put(
key: UnsafeRow,
value: UnsafeRow): Unit
Used when:
StreamingDeduplicateExec and
put StreamingGlobalLimitExec physical operators are
executed
StateManagerImplBase is requested to putState
StreamingAggregationStateManagerImplV1 and
StreamingAggregationStateManagerImplV2 are
requested to store a row in a state store
Used when:
Physical operators with WatermarkSupport are
requested to removeKeysOlderThanWatermark
remove
StreamingAggregationStateManagerBaseImpl is requested
to remove a key from a state store
KeyToNumValuesStore is requested to remove a key
version: Long
144
StateStore
Refer to Logging.
coordinatorRef: Option[StateStoreCoordinatorRef]
coordinatorRef requests the SparkEnv helper object for the current SparkEnv .
If the SparkEnv is available and the _coordRef is not assigned yet, coordinatorRef prints
out the following DEBUG message to the logs followed by requesting the
StateStoreCoordinatorRef for the StateStoreCoordinator endpoint.
Getting StateStoreCoordinatorRef
If the SparkEnv is available, coordinatorRef prints out the following INFO message to the
logs:
145
StateStore
unload …FIXME
stop(): Unit
stop …FIXME
reportActiveStoreInstance(
storeProviderId: StateStoreProviderId): Unit
reportActiveStoreInstance takes the current host and executorId (from the BlockManager
In the end, reportActiveStoreInstance prints out the following INFO message to the logs:
StateStoreProviders.
146
StateStore
get(
storeProviderId: StateStoreProviderId,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int],
version: Long,
storeConf: StateStoreConf,
hadoopConf: Configuration): StateStore
The version is either the current epoch (in Continuous Stream Processing) or
Note
the current batch ID (in Micro-Batch Stream Processing).
get will also start the periodic maintenance task (unless already started) and announce the
new StateStoreProvider.
In the end, get requests the StateStoreProvider to look up the StateStore by the specified
version.
startMaintenanceIfNeeded(): Unit
147
StateStore
doMaintenance(): Unit
Doing maintenance
Unloaded [provider]
verifyIfStoreInstanceActive …FIXME
Internal Properties
148
StateStore
Name Description
149
StateStoreId
a StateStoreId and the run ID of a streaming query) that is then used for the preferred
locations of a partition of a StateStoreAwareZipPartitionsRDD (executed on the driver)
and to…FIXME
The name of the default state store (for reading state store data that was generated before
store names were used, i.e. in Spark 2.2 and earlier) is default.
storeCheckpointLocation(): Path
storeCheckpointLocation is Hadoop DFS’s Path of the checkpoint location (for the stateful
operator by operator ID, the partition by the partition ID in the checkpoint root location).
If the default store name is used (for Spark 2.2 and earlier), the storeName is not included in
the path.
150
HDFSBackedStateStore
HDFSBackedStateStore — State Store on
HDFS-Compatible File System
HDFSBackedStateStore is a concrete StateStore that uses a Hadoop DFS-compatible file
requested for the specified version of state (store) for update (when StateStore utility is
requested to look up a StateStore by provider id).
HDFSBackedStateStoreProvider.
Version
state: STATE
state is the current state of HDFSBackedStateStore and can be in one of the three possible
State changes (to the internal mapToUpdate registry) are allowed as long as
HDFSBackedStateStore is in the default UPDATING state. Right after a HDFSBackedStateStore
transitions to either COMMITTED or ABORTED state, no further state changes are allowed.
151
HDFSBackedStateStore
Don’t get confused with the term "state" as there are two states: the internal
Note state of HDFSBackedStateStore and the state of a streaming query (that
HDFSBackedStateStore is responsible for).
After commit
COMMITTED
hasCommitted flag indicates whether HDFSBackedStateStore
is in this state or not.
writeUpdateToDeltaFile(
output: DataOutputStream,
key: UnsafeRow,
value: UnsafeRow): Unit
Caution FIXME
put Method
put(
key: UnsafeRow,
value: UnsafeRow): Unit
put stores the copies of the key and value in mapToUpdate internal registry followed by
state:
152
HDFSBackedStateStore
commit(): Long
new version of state) (with the newVersion, the mapToUpdate and the compressed stream).
state:
abort(): Unit
Note abort is part of the StateStore Contract to abort the state changes.
abort …FIXME
metrics: StateStoreMetrics
153
HDFSBackedStateStore
The performance metrics of the provider used are only the ones listed in
supportedCustomMetrics.
Memory used (in bytes) as the memoryUsedBytes metric (of the parent provider)
hasCommitted: Boolean
false otherwise.
Internal Properties
154
HDFSBackedStateStore
Name Description
compressedStream: DataOutputStream
compressedStream
finalDeltaFile: Path
finalDeltaFile
newVersion: Long
newVersion
Used exclusively when HDFSBackedStateStore is requested for the
finalDeltaFile, to commit and abort
155
StateStoreProvider
close(): Unit
close
Closes the state store provider
Used exclusively when StateStore helper object is
requested to unload a state store provider
doMaintenance(): Unit = {}
getStore(
version: Long): StateStore
getStore
Finds the StateStore for the specified version
156
StateStoreProvider
init(
stateStoreId: StateStoreId,
keySchema: StructType,
valueSchema: StructType,
init keyIndexOrdinal: Option[Int],
storeConfs: StateStoreConf,
hadoopConf: Configuration): Unit
stateStoreId: StateStoreId
supportedCustomMetrics: Seq[StateStoreCustomMetric]
createAndInit(
stateStoreId: StateStoreId,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int],
storeConf: StateStoreConf,
hadoopConf: Configuration): StateStoreProvider
157
StateStoreProvider
158
StateStoreProviderId
StateStoreProviderId — Unique Identifier of
State Store Provider
StateStoreProviderId is a unique identifier of a state store provider with the following
properties:
StateStoreId
In other words, StateStoreProviderId is a StateStoreId with the run ID that is different every
restart.
StateStoreCoordinator to track the executors of state store providers (on the driver)
compute a partition
apply(
stateInfo: StatefulOperatorStateInfo,
partitionIndex: Int,
storeName: String): StateStoreProviderId
Internally, apply requests the StatefulOperatorStateInfo for the checkpoint directory (aka
checkpointLocation) and the stateful operator ID and creates a new StateStoreId (with the
partitionIndex and storeName ).
In the end, apply requests the StatefulOperatorStateInfo for the run ID of a streaming
query and creates a new StateStoreProviderId (together with the run ID).
159
StateStoreProviderId
160
HDFSBackedStateStoreProvider
HDFSBackedStateStoreProvider — Hadoop
DFS-based StateStoreProvider
HDFSBackedStateStoreProvider is a StateStoreProvider that uses a Hadoop DFS-compatible
Refer to Logging.
Performance Metrics
161
HDFSBackedStateStoreProvider
estimated size of state Estimated size of the current state (of the
only on current version HDFSBackedStateStore)
baseDir: Path
baseDir is the base directory (as Hadoop DFS’s Path) for state checkpointing (for delta and
created.
baseDir is initialized and created based on the state checkpoint base directory of the
StateStoreProvider contract).
162
HDFSBackedStateStoreProvider
toString: String
getStore(
version: Long): StateStore
the specified version of state (from internal cache or snapshot and delta files) for versions
greater than 0 .
In the end, getStore creates a new HDFSBackedStateStore for the specified version with
the new state and prints out the following INFO message to the logs:
(negative):
163
HDFSBackedStateStoreProvider
deltaFile simply returns the Hadoop Path of the [version].delta file in the state
snapshotFile simply returns the Hadoop Path of the [version].snapshot file in the state
fetchFiles(): Seq[StoreFile]
fetchFiles requests the CheckpointFileManager for all the files in the state checkpoint
directory.
For every file, fetchFiles splits the name into two parts with . (dot) as a separator (files
with more or less than two parts are simply ignored) and registers a new StoreFile for
snapshot and delta files:
For snapshot files, fetchFiles creates a new StoreFile with isSnapshot flag on
( true )
For delta files, fetchFiles creates a new StoreFile with isSnapshot flag off
( false )
Note delta files are only registered if there was no snapshot file for the version.
fetchFiles prints out the following WARN message to the logs for any other files:
164
HDFSBackedStateStoreProvider
In the end, fetchFiles sorts the StoreFiles based on their version, prints out the following
DEBUG message to the logs, and returns the files.
init(
stateStoreId: StateStoreId,
keySchema: StructType,
valueSchema: StructType,
indexOrdinal: Option[Int],
storeConf: StateStoreConf,
hadoopConf: Configuration): Unit
init records the values of the input arguments as the stateStoreId, keySchema,
In the end, init requests the CheckpointFileManager to create the baseDir directory (with
parent directories).
filesForVersion(
allFiles: Seq[StoreFile],
version: Long): Seq[StoreFile]
filesForVersion finds the latest snapshot version among the given allFiles files up to
and including the given version (it may or may not be available).
165
HDFSBackedStateStoreProvider
If a snapshot file was found (among the given file up to and including the given version),
filesForVersion takes all delta files between the version of the snapshot file (exclusive)
and the given version (inclusive) from the given allFiles files.
The number of delta files should be the given version minus the snapshot
Note
version.
If a snapshot file was not found, filesForVersion takes all delta files up to the given version
(inclusive) from the given allFiles files.
In the end, filesForVersion returns a snapshot version (if available) and all delta files up to
the given version (inclusive).
doMaintenance(): Unit
doMaintenance simply does state snapshoting followed by cleaning up (removing old state
files).
In case of any non-fatal errors, doMaintenance simply prints out the following WARN
message to the logs:
doSnapshot(): Unit
doSnapshot lists all delta and snapshot files in the state checkpoint directory ( files ) and
166
HDFSBackedStateStoreProvider
doSnapshot returns immediately (and does nothing) when there are no delta and snapshot
files.
doSnapshot finds the snapshot file and delta files for the version (among the files and for the
last version).
When the last version was found in the cache and the number of delta files is above
spark.sql.streaming.stateStore.minDeltasForSnapshot internal threshold, doSnapshot writes
a compressed snapshot file for the last version.
In the end, doSnapshot prints out the following DEBUG message to the logs:
In case of non-fatal errors, doSnapshot simply prints out the following WARN message to
the logs:
cleanup(): Unit
cleanup lists all delta and snapshot files in the state checkpoint directory ( files ) and
cleanup returns immediately (and does nothing) when there are no delta and snapshot
files.
cleanup takes the version of the latest state file ( lastVersion ) and decrements it by
167
HDFSBackedStateStoreProvider
cleanup requests the CheckpointFileManager to delete the path of every old state file.
In the end, cleanup prints out the following INFO message to the logs:
In case of a non-fatal exception, cleanup prints out the following WARN message to the
logs:
close(): Unit
close …FIXME
getMetricsForProvider Method
memoryUsedBytes
metricLoadedMapCacheHit
metricLoadedMapCacheMiss
168
HDFSBackedStateStoreProvider
Supported StateStoreCustomMetrics
— supportedCustomMetrics Method
supportedCustomMetrics: Seq[StateStoreCustomMetric]
metricStateOnCurrentVersionSizeBytes
metricLoadedMapCacheHit
metricLoadedMapCacheMiss
commitUpdates(
newVersion: Long,
map: ConcurrentHashMap[UnsafeRow, UnsafeRow],
output: DataOutputStream): Unit
new version of state (with the given newVersion and the map state).
loadMap(
version: Long): ConcurrentHashMap[UnsafeRow, UnsafeRow]
loadMap firstly tries to find the state version in the loadedMaps internal cache and, if found,
If the requested state version could not be found in the loadedMaps internal cache, loadMap
prints out the following WARN message to the logs:
169
HDFSBackedStateStoreProvider
loadMap tries to load the state snapshot file for the version and, if found, puts the version of
If not found, loadMap tries to find the most recent state version by decrementing the
requested version until one is found in the loadedMaps internal cache or loaded from a state
snapshot (file).
loadMap updateFromDeltaFile for all the remaining versions (from the snapshot version up
to the requested one). loadMap puts the final version of state in the internal cache (the
closest snapshot and the remaining delta versions) and returns it.
In the end, loadMap prints out the following DEBUG message to the logs:
readSnapshotFile(
version: Long): Option[ConcurrentHashMap[UnsafeRow, UnsafeRow]]
readSnapshotFile creates the path of the snapshot file for the given version .
readSnapshotFile requests the CheckpointFileManager to open the snapshot file for reading
readSnapshotFile reads the decompressed input stream until an EOF (that is marked as the
integer -1 in the stream) and inserts key and value rows in a state map
( ConcurrentHashMap[UnsafeRow, UnsafeRow] ):
First integer is the size of a key (buffer) followed by the key itself (of the size).
readSnapshotFile creates an UnsafeRow for the key (with the number of fields as
170
HDFSBackedStateStoreProvider
Next integer is the size of a value (buffer) followed by the value itself (of the size).
readSnapshotFile creates an UnsafeRow for the value (with the number of fields as
In the end, readSnapshotFile prints out the following INFO message to the logs and returns
the key-value map.
Error reading snapshot file [fileToRead] of [this]: [key|value] size cannot be [keySiz
e|valueSize]
updateFromDeltaFile(
version: Long,
map: ConcurrentHashMap[UnsafeRow, UnsafeRow]): Unit
updateFromDeltaFile creates the path of the delta file for the requested version .
171
HDFSBackedStateStoreProvider
updateFromDeltaFile requests the CheckpointFileManager to open the delta file for reading
updateFromDeltaFile reads the decompressed input stream until an EOF (that is marked as
the integer -1 in the stream) and inserts key and value rows in the given state map:
First integer is the size of a key (buffer) followed by the key itself (of the size).
updateFromDeltaFile creates an UnsafeRow for the key (with the number of fields as
Next integer is the size of a value (buffer) followed by the value itself (of the size).
updateFromDeltaFile creates an UnsafeRow for the value (with the number of fields as
indicated by the number of fields of the value schema) or removes the corresponding
key from the state map (if the value size is -1 )
updateFromDeltaFile removes the key-value entry from the state map if the
Note
value (size) is -1 .
In the end, updateFromDeltaFile prints out the following INFO message to the logs and
returns the key-value map.
Error reading delta file [fileToRead] of [this]: [fileToRead] does not exist
putStateIntoStateCacheMap(
newVersion: Long,
map: ConcurrentHashMap[UnsafeRow, UnsafeRow]): Unit
putStateIntoStateCacheMap registers state for a given version, i.e. adds the map state under
172
HDFSBackedStateStoreProvider
and returns.
It does not add the given state when the version of the oldest state is earlier (larger)
than the given newVersion
It removes the oldest state when older (smaller) than the given newVersion
writeSnapshotFile(
version: Long,
map: ConcurrentHashMap[UnsafeRow, UnsafeRow]): Unit
For every key-value UnsafeRow pair in the given map, writeSnapshotFile writes the size of
the key followed by the key itself (as bytes). writeSnapshotFile then writes the size of the
value followed by the value itself (as bytes).
In the end, writeSnapshotFile prints out the following INFO message to the logs:
173
HDFSBackedStateStoreProvider
compressStream(
outputStream: DataOutputStream): DataOutputStream
cancelDeltaFile(
compressedStream: DataOutputStream,
rawStream: CancellableFSDataOutputStream): Unit
cancelDeltaFile …FIXME
finalizeDeltaFile(
output: DataOutputStream): Unit
finalizeDeltaFile simply writes -1 to the given DataOutputStream (to indicate end of file)
174
HDFSBackedStateStoreProvider
loadedMaps: TreeMap[
Long, // version
ConcurrentHashMap[UnsafeRow, UnsafeRow]] // state (as keys and values)
A new entry (a version and the state updates) can only be added when
HDFSBackedStateStoreProvider is requested to putStateIntoStateCacheMap (and only when
specified version of state (from the internal cache or snapshot and delta files). Positive hits
(when a version could be found in the cache) is available as the count of cache hit on states
cache in provider performance metric while misses are counted in the count of cache miss
on states cache in provider performance metric.
The state deltas (the values) in loadedMaps are cleared (all entries removed) when
HDFSBackedStateStoreProvider is requested to close.
Loading the specified version of state (from the internal cache or snapshot and delta
files)
Internal Properties
175
HDFSBackedStateStoreProvider
Name Description
keySchema: StructType
keySchema
valueSchema: StructType
valueSchema
numberOfVersionsToRetainInMemory: Int
sparkConf SparkConf
176
HDFSBackedStateStoreProvider
177
StateStoreCoordinator
executor ID).
GetLocation You should see the following DEBUG message in the logs:
ReportActiveInstance
178
StateStoreCoordinator
StateStoreCoordinator stopped
Refer to Logging.
179
StateStoreCoordinator
180
StateStoreCoordinatorRef
StateStoreCoordinatorRef — RPC Endpoint
Reference to StateStoreCoordinator
StateStoreCoordinatorRef is used to (let the tasks on Spark executors to) send messages to
getLocation(
stateStoreProviderId: StateStoreProviderId): Option[
String]
181
StateStoreCoordinatorRef
reportActiveInstance(
stateStoreProviderId: StateStoreProviderId,
host: String,
executorId: String): Unit
stop(): Unit
stop
Requests the RpcEndpointRef to send a StopCoordinator
synchronous message
Used exclusively for unit testing
verifyIfInstanceActive(
stateStoreProviderId: StateStoreProviderId,
executorId: String): Boolean
Creating StateStoreCoordinatorRef to
StateStoreCoordinator RPC Endpoint for Driver
— forDriver Factory Method
forDriver …FIXME
182
StateStoreCoordinatorRef
Creating StateStoreCoordinatorRef to
StateStoreCoordinator RPC Endpoint for Executor
— forExecutor Factory Method
forExecutor …FIXME
183
WatermarkSupport
184
WatermarkSupport
Optional Catalyst expression that matches rows older than the event tim
watermark.
watermarkExpression
The watermark attribute may be of type StructType . If it is,
watermarkExpression uses the first field as the watermark.
Enable INFO logging level for one of the stateful physical opera
Tip
see the INFO message in the logs.
watermarkPredicateForData
Optional Predicate that uses watermarkExpression and the child outpu
match rows older than the event-time watermark
watermarkPredicateForKeys
Optional Predicate that uses keyExpressions to match rows older than
event time watermark.
WatermarkSupport Contract
package org.apache.spark.sql.execution.streaming
185
WatermarkSupport
removeKeysOlderThanWatermark Method
removeKeysOlderThanWatermark(
storeManager: StreamingAggregationStateManager,
store: StateStore): Unit
removeKeysOlderThanWatermark …FIXME
186
WatermarkSupport
187
StatefulOperator
StatefulOperator Contract — Physical
Operators That Read or Write to StateStore
StatefulOperator is the base of physical operators that read or write state (described by
stateInfo).
stateInfo: Option[StatefulOperatorStateInfo]
stateInfo
StateStoreReader
188
StateStoreReader
StateStoreReader
StateStoreReader is…FIXME
189
StateStoreWriter
that write to a state store and collect the write metrics for execution progress reporting.
Table 1. StateStoreWriters
StateStoreWriter Description
FlatMapGroupsWithStateExec
StateStoreSaveExec
StreamingDeduplicateExec
StreamingGlobalLimitExec
StreamingSymmetricHashJoinExec
number of updated
state rows
190
StateStoreWriter
setStoreMetrics requests the specified StateStore for the metrics and records the following
FlatMapGroupsWithStateExec
Note StateStoreSaveExec
StreamingDeduplicateExec
StreamingGlobalLimitExec
getProgress Method
getProgress(): StateOperatorProgress
getProgress …FIXME
batch is not required given the OffsetSeqMetadata with the event-time watermark and the
batch timestamp).
191
StateStoreWriter
stateStoreCustomMetrics …FIXME
timeTakenMs Method
timeTakenMs …FIXME
192
StatefulOperatorStateInfo
StatefulOperatorStateInfo
StatefulOperatorStateInfo identifies the state store for a given stateful physical operator:
Number of partitions
for nextStatefulOperationStateInfo.
193
StateStoreMetrics
StateStoreMetrics
StateStoreMetrics holds the performance metrics of a state store:
Number of keys
StateStoreMetrics is used (and created) when the following are requested for the
performance metrics:
StateStore
StateStoreHandler
SymmetricHashJoinStateManager
194
StateStoreCustomMetric
StateStoreCustomMetric Contract
StateStoreCustomMetric is the abstraction of metrics that a state store may wish to expose
StateStoreMetrics is created
desc: String
desc
name: String
name
Table 2. StateStoreCustomMetrics
StateStoreCustomMetric Description
StateStoreCustomSizeMetric
StateStoreCustomSumMetric
StateStoreCustomTimingMetric
195
StateStoreUpdater
StateStoreUpdater
StateStoreUpdater is…FIXME
updateStateForKeysWithData Method
Caution FIXME
updateStateForTimedOutKeys Method
Caution FIXME
196
EventTimeStatsAccum
EventTimeStatsAccum Accumulator — Event-
Time Column Statistics for
EventTimeWatermarkExec Physical Operator
EventTimeStatsAccum is a Spark accumulator that is used for the statistics of the event-time
Maximum value
Minimum value
Average value
physical operator.
Note add is part of the AccumulatorV2 Contract to add (accumulate) a given value.
197
EventTimeStatsAccum
EventTimeStats
EventTimeStats is a Scala case class for the event-time column statistics.
EventTimeStats.add Method
add simply updates the event-time column statistics per given eventTime .
EventTimeStats.merge Method
merge …FIXME
198
StateStoreConf
StateStoreConf
StateStoreConf is…FIXME
maxVersionsToRetainInMemory spark.sql.streaming.maxBatchesToRetainInMemory
spark.sql.streaming.minBatchesToRetain
minVersionsToRetain
Used exclusively when HDFSBackedStateStoreProvider
is requested for cleanup.
spark.sql.streaming.stateStore.providerClass
providerClass Used exclusively when StateStoreProvider helper
object is requested to create and initialize the
StateStoreProvider.
199
Arbitrary Stateful Streaming Aggregation
operator.
streaming aggregation state that is created separately for every aggregation key with an
aggregation state value (of a user-defined type).
aggregation state timeout that defines when a GroupState can be considered timed-out
(expired).
Demos
Use the following demos and complete applications to learn more:
FlatMapGroupsWithStateApp
Performance Metrics
Arbitrary Stateful Streaming Aggregation uses performance metrics (of the
StateStoreWriter through FlatMapGroupsWithStateExec physical operator).
Internals
One of the most important internal execution components of Arbitrary Stateful Streaming
Aggregation is FlatMapGroupsWithStateExec physical operator.
200
Arbitrary Stateful Streaming Aggregation
When requested to execute and generate a recipe for a distributed computation (as an
RDD[InternalRow]), FlatMapGroupsWithStateExec first validates a selected
GroupStateTimeout:
For EventTimeTimeout, event-time watermark has to be defined and the input schema
has the watermark attribute
StateStoreRDD is used to properly distribute tasks across executors (per preferred locations)
FlatMapGroupsWithStateExec physical operator uses state managers that are different than
201
GroupState
Aggregation.
mapGroupsWithState
flatMapGroupsWithState
exists: Boolean
exists
Checks whether the state value exists or not
If not exists, get throws a NoSuchElementException . Use
getOption instead.
get: S
get
getCurrentProcessingTimeMs(): Long
getCurrentProcessingTimeMs
Gets the current processing time (as milliseconds in
epoch time)
getCurrentWatermarkMs(): Long
getCurrentWatermarkMs
Gets the current event time watermark (as milliseconds
in epoch time)
getOption: Option[S]
202
GroupState
getOption
Used when:
InputProcessor is requested to
callFunctionAndUpdateState (when the row iterator
is consumed and a state value has been updated,
removed or timeout changed)
GroupStateImpl is requested for the textual
representation
hasTimedOut: Boolean
hasTimedOut
Whether the state (for a given key) has timed out or not.
remove(): Unit
remove
setTimeoutDuration
Specifies the timeout duration for the state key (in
millis or as a string, e.g. "10 seconds", "1 hour") for
GroupStateTimeout.ProcessingTimeTimeout
203
GroupState
204
GroupStateImpl
GroupStateImpl
GroupStateImpl is the default and only known GroupState in Spark Structured Streaming.
following:
createForStreaming
createForBatch
eventTimeWatermarkMs
GroupStateTimeout
hasTimedOut flag
watermarkPresent flag
createForStreaming[S](
optionalValue: Option[S],
batchProcessingTimeMs: Long,
eventTimeWatermarkMs: Long,
timeoutConf: GroupStateTimeout,
hasTimedOut: Boolean,
watermarkPresent: Boolean): GroupStateImpl[S]
createForStreaming simply creates a new GroupStateImpl with the given input arguments.
205
GroupStateImpl
createForBatch(
timeoutConf: GroupStateTimeout,
watermarkPresent: Boolean): GroupStateImpl[Any]
createForBatch …FIXME
toString: String
toString …FIXME
setTimeoutDuration …FIXME
206
GroupStateImpl
setTimeoutTimestamp …FIXME
getCurrentProcessingTimeMs(): Long
update …FIXME
remove(): Unit
remove …FIXME
Internal Properties
207
GroupStateImpl
Name Description
FIXME
value
Used when…FIXME
FIXME
defined
Used when…FIXME
Updated flag that says whether the state has been updated
or not
Default: false
updated
Disabled ( false ) when GroupStateImpl is requested to
remove the state
Enabled ( true ) when GroupStateImpl is requested to
update the state
208
GroupStateTimeout
mapGroupsWithState
flatMapGroupsWithState
Table 1. GroupStateTimeouts
GroupStateTimeout Description
Timeout based on event time
EventTimeTimeout
Used when…FIXME
No timeout
NoTimeout
Used when…FIXME
ProcessingTimeTimeout
is used.
batchTimestampMs is defined when IncrementalExecution is
created (with the state). IncrementalExecution is given
OffsetSeqMetadata when StreamExecution is requested to
run a streaming batch.
209
StateManager
getAllState
Retrieves all state data (for all keys) from the StateStore
Used exclusively when InputProcessor is requested to
processTimedOutState
getState(
store: StateStore,
keyRow: UnsafeRow): StateData
getState
Gets the state data for the key from the StateStore
Used exclusively when InputProcessor is requested to
processNewData
putState(
store: StateStore,
keyRow: UnsafeRow,
state: Any,
timeoutTimestamp: Long): Unit
putState
Persists (puts) the state value for the key in the StateStore
Used exclusively when InputProcessor is requested to
callFunctionAndUpdateState (right after all rows have been
processed)
removeState(
store: StateStore,
keyRow: UnsafeRow): Unit
210
StateManager
stateSchema: StructType
State schema
Used when:
FlatMapGroupsWithStateExec physical operator is
requested to execute and generate a recipe for a
distributed computation (as an RDD[InternalRow])
StateManager is a Scala sealed trait which means that all the implementations
Note
are in the same compilation unit (a single file).
211
StateManagerImplV2
StateManagerImplV2 — Default StateManager
of FlatMapGroupsWithStateExec Physical
Operator
StateManagerImplV2 is a concrete StateManager (as a StateManagerImplBase) that is used
shouldStoreTimestamp flag
stateSchema: StructType
Note stateSchema is part of the StateManager Contract for the schema of the state.
stateSchema …FIXME
stateSerializerExprs: Seq[Expression]
stateSerializerExprs …FIXME
212
StateManagerImplV2
stateDeserializerExpr: Expression
stateDeserializerExpr …FIXME
Internal Properties
Name Description
Position of the state in a state row ( 0 )
nestedStateOrdinal
Used when…FIXME
213
StateManagerImplBase
StateManagerImplBase
StateManagerImplBase is the extension of the StateManager contract for state managers of
stateDeserializerExpr: Expression
stateDeserializerExpr
State deserializer, i.e. a Catalyst expression to
deserialize a state object from a row ( UnsafeRow )
Used exclusively for the stateDeserializerFunc
stateSerializerExprs: Seq[Expression]
stateSerializerExprs
State serializer, i.e. Catalyst expressions to serialize
a state object to a row ( UnsafeRow )
Used exclusively for the stateSerializerFunc
timeoutTimestampOrdinalInRow: Int
timeoutTimestampOrdinalInRow
Position of the timeout timestamp in a state row
Used when StateManagerImplBase is requested to get
and set timeout timestamp
Table 2. StateManagerImplBases
StateManagerImplBase Description
StateManagerImplV1 Legacy StateManager
214
StateManagerImplBase
getState(
store: StateStore,
keyRow: UnsafeRow): StateData
getState is part of the StateManager Contract to get the state data for the key
Note
from the StateStore.
getState …FIXME
putState(
store: StateStore,
key: UnsafeRow,
state: Any,
timestamp: Long): Unit
putState is part of the StateManager Contract to persist (put) the state value
Note
for the key in the StateStore.
putState …FIXME
removeState(
store: StateStore,
keyRow: UnsafeRow): Unit
215
StateManagerImplBase
removeState is part of the StateManager Contract to remove the state for the
Note
key from the StateStore.
removeState …FIXME
getAllState is part of the StateManager Contract to retrieve all state data (for
Note
all keys) from the StateStore.
getAllState …FIXME
getStateObject …FIXME
getStateRow …FIXME
getTimestamp …FIXME
216
StateManagerImplBase
setTimestamp(
stateRow: UnsafeRow,
timeoutTimestamps: Long): Unit
setTimestamp …FIXME
Internal Properties
Name Description
stateDataForGets
Empty StateData to share (reuse) between getState calls
(to avoid high use of memory with many StateData objects)
217
StateManagerImplV1
StateManagerImplV1
StateManagerImplV1 is…FIXME
218
FlatMapGroupsWithStateExecHelper Helper Class
FlatMapGroupsWithStateExecHelper
FlatMapGroupsWithStateExecHelper is a utility with the main purpose of creating a
createStateManager(
stateEncoder: ExpressionEncoder[Any],
shouldStoreTimestamp: Boolean,
stateFormatVersion: Int): StateManager
StateManagerImplV1 for 1
StateManagerImplV2 for 2
2 :
219
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator
InputProcessor takes a single StateStore to be created. The StateStore manages the per-
group state (and is used when processing new data and timed-out state data, and in the "all
rows processed" callback).
processNewData creates a grouped iterator of (of pairs of) per-group state keys and the row
values from the given data iterator ( dataIter ) with the grouping attributes and the output
schema of the child operator (of the parent FlatMapGroupsWithStateExec physical operator).
For every per-group state key (in the grouped iterator), processNewData requests the
StateManager (of the parent FlatMapGroupsWithStateExec physical operator) to get the state
(from the StateStore) and callFunctionAndUpdateState (with the hasTimedOut flag off, i.e.
false ).
220
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator
processTimedOutState(): Iterator[InternalRow]
GroupStateTimeout.NoTimeout.
With timeout enabled, processTimedOutState gets the current timeout threshold per
GroupStateTimeout:
StateManager for all the available state data (in the StateStore) and takes only the state
data with timeout defined and below the current timeout threshold.
callFunctionAndUpdateState(
stateData: StateData,
valueRowIter: Iterator[InternalRow],
hasTimedOut: Boolean): Iterator[InternalRow]
callFunctionAndUpdateState creates a key object by requesting the given StateData for the
UnsafeRow of the key (keyRow) and converts it to an object (using the internal state key
converter).
callFunctionAndUpdateState creates value objects by taking every value row (from the given
valueRowIter iterator) and converts them to objects (using the internal state value
converter).
221
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator
The current state value (of the given StateData ) that could possibly be null
callFunctionAndUpdateState then executes the user-defined state function (of the parent
FlatMapGroupsWithStateExec operator) on the key object, value objects, and the newly-
created GroupStateImpl .
For every output value from the user-defined state function, callFunctionAndUpdateState
updates numOutputRows performance metric and wraps the values to an internal row (using
the internal output value converter).
onIteratorCompletion: Unit
onIteratorCompletion branches off per whether the GroupStateImpl has been marked
When the GroupStateImpl has been marked removed and no timeout timestamp is
specified, onIteratorCompletion does the following:
222
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator
Otherwise, when the GroupStateImpl has not been marked removed or the timeout
timestamp is specified, onIteratorCompletion checks whether the timeout timestamp has
changed by comparing the timeout timestamps of the GroupStateImpl and the given
StateData .
(only when the GroupStateImpl has been updated, removed or the timeout timestamp
changed) onIteratorCompletion does the following:
Internal Properties
223
InputProcessor Helper Class of FlatMapGroupsWithStateExec Physical Operator
Name Description
getOutputRow
The data type of the row is specified as the data type of
the output object attribute when the parent
FlatMapGroupsWithStateExec operator is created
224
DataStreamReader
format
Specifies the format of the data source
The format is used internally as the name (alias) of the streaming
source to use to load the data
load(): DataFrame
load(path: String): DataFrame (1)
225
DataStreamReader
schema
1. Uses a DDL-formatted table schema
Specifies the user-defined schema of the streaming data source
(as a StructType or DDL-formatted table schema, e.g. a INT, b
STRING )
Streaming loads datasets from a streaming source (that in the end creates a logical plan for
a streaming query).
226
DataStreamReader
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
DataStreamReader supports many source formats natively and offers the interface to define
custom formats:
json
csv
parquet
text
DataStreamReader assumes parquet file format by default that you can change
Note
using spark.sql.sources.default property.
After you have described the streaming pipeline to read datasets from an external
streaming data source, you eventually trigger the loading using format-agnostic load or
format-specific (e.g. json, csv) operators.
source
spark.sql.sources.default Source format of datasets in a
property streaming data source
Collection of key-value
extraOptions (empty)
configuration options
227
DataStreamReader
There is support for values of String , Boolean , Long , and Double types for user
convenience, and internally are converted to String type.
You can also set options in bulk using options method. You have to do the type
Note
conversion yourself, though.
load(): DataFrame
load(path: String): DataFrame (1)
load …FIXME
Built-in Formats
DataStreamReader can load streaming datasets from data sources of the following formats:
json
csv
parquet
text
228
DataStreamWriter
DataStreamWriter — Writing Datasets To
Streaming Sink
DataStreamWriter is the interface to describe when and what rows of a streaming query are
import org.apache.spark.sql.streaming.DataStreamWriter
import org.apache.spark.sql.Row
assert(streamingQuery.isStreaming)
foreachBatch(
function: (Dataset[T], Long) => Unit): DataStreamWriter[T]
229
DataStreamWriter
format Specifies the format of the data sink (aka output format)
The format is used internally as the name (alias) of the streaming
sink to use to write the data to
start(): StreamingQuery
start(path: String): StreamingQuery (1)
start
1. Explicit path (that could also be specified as an option)
trigger
Sets the Trigger for how often a streaming query should be
executed and the result saved.
230
DataStreamWriter
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
import org.apache.spark.sql.DataFrame
val rates: DataFrame = spark.
readStream.
format("rate").
Note load
scala> rates.isStreaming
res1: Boolean = true
scala> rates.queryExecution.logical.isStreaming
res2: Boolean = true
Like the batch DataFrameWriter , DataStreamWriter has a direct support for many file
formats and an extension point to plug in new formats.
In the end, you start the actual continuous writing of the result of executing a Dataset to a
sink using start operator.
writer.save
Beside the above operators, there are the following to work with a Dataset as a whole.
Internally, option adds the key and value to extraOptions internal option registry.
231
DataStreamWriter
outputMode specifies the output mode of a streaming query, i.e. what data is sent out to a
streaming sink when there is new data available in streaming data sources.
Note When not defined explicitly, outputMode defaults to Append output mode.
trigger method sets the time interval of the trigger (that executes a batch runner) for a
streaming query.
The default trigger is ProcessingTime(0L) that runs a streaming query as often as possible.
start(): StreamingQuery
start(path: String): StreamingQuery (1)
232
DataStreamWriter
Whether or not you have to specify path option depends on the streaming sink
Note
in use.
memory
foreach
other formats
…FIXME
val q = spark.
readStream.
text("server-logs/*").
writeStream.
format("hive") <-- hive format used as a streaming sink
scala> q.start
org.apache.spark.sql.AnalysisException: Hive data source can only be used with tables,
you can not write files of Hive data source directly.;
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:234)
... 48 elided
233
DataStreamWriter
Internally, foreach sets the streaming output format as foreach and foreachWriter as the
input writer .
Internal Properties
Initial
Name Description
Value
extraOptions
foreachBatchWriter null
The function that is used as the batch writer in the
ForeachBatchSink for foreachBatch
foreachWriter
partitioningColumns
source
trigger
234
OutputMode
OutputMode
Output mode ( OutputMode ) of a streaming query describes what data is written to a
streaming sink.
Append
Complete
Update
The output mode is specified on the writing side of a streaming query using
DataStreamWriter.outputMode method (by alias or a value of
org.apache.spark.sql.streaming.OutputMode object).
import org.apache.spark.sql.streaming.OutputMode.Update
val inputStream = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.outputMode(Update) // <-- update output mode
.start
In streaming aggregations, a "new" row is when the intermediate state becomes final, i.e.
when new events for the grouping key can only be considered late which is when watermark
moves past the event time of the key.
Append output mode requires that a streaming query defines event-time watermark (using
withWatermark operator) on the event time column that is used in aggregation (directly or
using window function).
structured query.
235
OutputMode
Complete mode does not drop old aggregation state and preserves all data in the Result
Table.
For queries that are not streaming aggregations, Update is equivalent to the Append output
mode.
236
Trigger
Examples of ProcessingTime
You specify the trigger for a streaming query using DataStreamWriter 's trigger
Note
method.
237
Trigger
import org.apache.spark.sql.streaming.Trigger
val query = spark.
readStream.
format("rate").
load.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.Once). // <-- execute once and stop
queryName("rate-once").
start
assert(query.isActive == false)
scala> println(query.lastProgress)
{
"id" : "2ae4b0a4-434f-4ca7-a523-4e859c07175b",
"runId" : "24039ce5-906c-4f90-b6e7-bbb3ec38a1f5",
"name" : "rate-once",
"timestamp" : "2017-07-04T18:39:35.998Z",
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 1365,
"getBatch" : 29,
"getOffset" : 0,
"queryPlanning" : 285,
"triggerExecution" : 1742,
"walCommit" : 40
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]"
,
"startOffset" : null,
"endOffset" : 0,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@7dbf277"
}
}
238
Trigger
import org.apache.spark.sql.streaming.Trigger
case object MyTrigger extends Trigger
scala> val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.trigger(MyTrigger) // <-- use custom trigger
.queryName("rate-custom-trigger")
.start
java.lang.IllegalStateException: Unknown type of trigger: MyTrigger
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.<init>(MicroBatchExe
cution.scala:60)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryMa
nager.scala:275)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryMan
ager.scala:316)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:325)
... 57 elided
Examples of ProcessingTime
ProcessingTime is a Trigger that assumes that milliseconds is the minimum time unit.
ProcessingTime(10)
ProcessingTime("10 milliseconds")
ProcessingTime("interval 10 milliseconds")
ProcessingTime(10.seconds)
239
Trigger
java.util.concurrent.TimeUnit instances.
ProcessingTime.create(10, TimeUnit.SECONDS)
240
StreamingQuery
StreamingQuery Contract
StreamingQuery is the contract of streaming queries that are executed continuously and
awaitTermination(): Unit
awaitTermination(timeoutMs: Long): Boolean
awaitTermination
Used when…FIXME
exception: Option[StreamingQueryException]
exception
StreamingQueryException if the query has finished due to
an exception
Used when…FIXME
explain(): Unit
explain(extended: Boolean): Unit
explain
Used when…FIXME
id: UUID
id
The unique identifier of the streaming query (that does
not change across restarts unlike runId)
Used when…FIXME
isActive: Boolean
241
StreamingQuery
Used when…FIXME
lastProgress: StreamingQueryProgress
lastProgress
The last StreamingQueryProgress of the streaming query
Used when…FIXME
name: String
name
The name of the query that is unique across all active
queries
Used when…FIXME
processAllAvailable(): Unit
recentProgress: Array[StreamingQueryProgress]
recentProgress
Collection of the recent StreamingQueryProgress
updates.
Used when…FIXME
runId: UUID
runId
The unique identifier of the current execution of the
streaming query (that is different every restart unlike id)
Used when…FIXME
sparkSession: SparkSession
sparkSession
Used when…FIXME
status: StreamingQueryStatus
242
StreamingQuery
Used when…FIXME
stop(): Unit
stop
active (started)
inactive (stopped)
sourceStatuses .
There could only be a single Sink for a StreamingQuery with many Sources.
243
Streaming Operators
crossJoin(
crossJoin right: Dataset[_]): DataFrame
dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates dropDuplicates(col1: String, cols: String*): Dataset[T]
explain(): Unit
explain(extended: Boolean): Unit
explain
groupByKey
Aggregates rows by a typed grouping function (and gives a
KeyValueGroupedDataset)
244
Streaming Operators
join(
right: Dataset[_]): DataFrame
join(
right: Dataset[_],
join joinExprs: Column): DataFrame
join(
right: Dataset[_],
joinExprs: Column,
joinType: String): DataFrame
join(
right: Dataset[_],
usingColumns: Seq[String]): DataFrame
join(
right: Dataset[_],
usingColumns: Seq[String],
joinType: String): DataFrame
join(
right: Dataset[_],
usingColumn: String): DataFrame
joinWith[U](
other: Dataset[U],
condition: Column): Dataset[(T, U)]
joinWith joinWith[U](
other: Dataset[U],
condition: Column,
joinType: String): Dataset[(T, U)]
withWatermark(
eventTime: String,
delayThreshold: String): Dataset[T]
withWatermark
writeStream: DataStreamWriter[T]
writeStream
Creates a DataStreamWriter for persisting the result of a
streaming query to an external data system
245
Streaming Operators
// stream processing
// replace [operator] with the operator of your choice
rates.[operator]
// output stream
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val sq = rates
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Complete)
.queryName("rate-console")
.start
// eventually...
sq.stop
246
dropDuplicates Operator
dropDuplicates Operator — Streaming
Deduplication
dropDuplicates(): Dataset[T]
dropDuplicates(colNames: Seq[String]): Dataset[T]
dropDuplicates(col1: String, cols: String*): Dataset[T]
dropDuplicates operator…FIXME
For a streaming Dataset, dropDuplicates will keep all data across triggers as
intermediate state to drop duplicates rows. You can use withWatermark
Note operator to limit how late the duplicate data can be and system will accordingly
limit the state. In addition, too late data older than watermark will be dropped to
avoid any possibility of duplicates.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
scala> println(ids.queryExecution.analyzed.numberedTreeString)
00 Project [cast(time#10 as bigint) AS time#15L, id#6]
01 +- Deduplicate [id#6], true
02 +- Project [cast(time#5 as timestamp) AS time#10, id#6]
03 +- Project [_1#2 AS time#5, _2#3 AS id#6]
04 +- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2, _2#3]
247
dropDuplicates Operator
queryName("dups").
outputMode(OutputMode.Append).
trigger(Trigger.ProcessingTime(30.seconds)).
option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save sta
te between restarts
start
q.processAllAvailable()
source.addData(4 -> 1)
source.addData(5 -> 2)
source.addData(6 -> 2)
scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
| 1| 1|
248
dropDuplicates Operator
| 5| 2|
+----+---+
val q = ids.
writeStream.
format("memory").
queryName("dups").
outputMode(OutputMode.Complete). // <-- memory sink supports checkpointing for Comp
lete output mode only
trigger(Trigger.ProcessingTime(30.seconds)).
option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save sta
te between restarts
start
// Doh! MemorySink is fine, but Complete is only available with a streaming aggregation
scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
+----+---+
source.addData(0 -> 1)
// wait till the batch is triggered
scala> spark.table("dups").show
+----+---+
|time| id|
+----+---+
| 0| 1|
+----+---+
source.addData(1 -> 1)
source.addData(2 -> 1)
249
dropDuplicates Operator
// Use groupBy to pass the requirement of having streaming aggregation for Complete ou
tput mode
val counts = ids.groupBy("id").agg(first($"time") as "first_time")
scala> counts.explain
== Physical Plan ==
*HashAggregate(keys=[id#246], functions=[first(time#255L, false)])
+- StateStoreSave [id#246], StatefulOperatorStateInfo(<unknown>,3585583b-42d7-4547-8d62
-255581c48275,0,0), Append, 0
+- *HashAggregate(keys=[id#246], functions=[merge_first(time#255L, false)])
+- StateStoreRestore [id#246], StatefulOperatorStateInfo(<unknown>,3585583b-42d7
-4547-8d62-255581c48275,0,0)
+- *HashAggregate(keys=[id#246], functions=[merge_first(time#255L, false)])
+- *HashAggregate(keys=[id#246], functions=[partial_first(time#255L, false
)])
+- *Project [cast(time#250 as bigint) AS time#255L, id#246]
+- StreamingDeduplicate [id#246], StatefulOperatorStateInfo(<unknown
>,3585583b-42d7-4547-8d62-255581c48275,1,0), 0
+- Exchange hashpartitioning(id#246, 200)
+- *Project [cast(_1#242 as timestamp) AS time#250, _2#243 AS
id#246]
+- StreamingRelation MemoryStream[_1#242,_2#243], [_1#242,
_2#243]
val q = counts.
writeStream.
format("memory").
queryName("dups").
outputMode(OutputMode.Complete). // <-- memory sink supports checkpointing for Comp
lete output mode only
trigger(Trigger.ProcessingTime(30.seconds)).
option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save sta
te between restarts
start
source.addData(0 -> 1)
source.addData(1 -> 1)
// wait till the batch is triggered
scala> spark.table("dups").show
+---+----------+
| id|first_time|
+---+----------+
| 1| 0|
+---+----------+
// Publish duplicates
250
dropDuplicates Operator
251
explain Operator
Dataset.explain is a high-level operator that prints the logical and (with extended flag
== Physical Plan ==
StreamingRelation rate, [timestamp#0, value#1L]
Internally, explain creates a ExplainCommand runnable command with the logical plan and
extended flag.
explain then executes the plan with ExplainCommand runnable command and collects the
252
explain Operator
253
groupBy Operator
groupBy operator…FIXME
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
254
groupBy Operator
import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
val sq = counts.writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(30.seconds)).
outputMode(OutputMode.Update). // <-- only Update or Complete acceptable because of
groupBy aggregation
start
255
groupBy Operator
855cec1c-25dc-4a86-ae54-c6cdd4ed02ec,0,0), Update, 0
+- ObjectHashAggregate(keys=[key#27], functions=[merge_collect_list(value#33, 0,
0), merge_collect_list(time#22-T600000ms, 0, 0)])
+- Exchange hashpartitioning(key#27, 200)
+- StateStoreRestore [key#27], StatefulOperatorStateInfo(file:/private/var
/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/temporary-635d6519-b6ca-4686-9b6b-5db0e83
cfd51/state,855cec1c-25dc-4a86-ae54-c6cdd4ed02ec,0,0)
+- ObjectHashAggregate(keys=[key#27], functions=[merge_collect_list(val
ue#33, 0, 0), merge_collect_list(time#22-T600000ms, 0, 0)])
+- Exchange hashpartitioning(key#27, 200)
+- ObjectHashAggregate(keys=[key#27], functions=[partial_collect_
list(value#33, 0, 0), partial_collect_list(time#22-T600000ms, 0, 0)])
+- EventTimeWatermark time#22: timestamp, interval 10 minutes
+- *Project [cast(split(cast(value#76 as string), ,)[0] as
timestamp) AS time#22, cast(split(cast(value#76 as string), ,)[1] as int) AS key#27, s
plit(cast(value#76 as string), ,)[2] AS value#33]
+- Scan ExistingRDD[key#75,value#76,topic#77,partition#78
,offset#79L,timestamp#80,timestampType#81]
256
groupByKey Operator
groupByKey Operator — Streaming
Aggregation
Introduction
Introduction
type T ) to apply aggregation functions over groups of rows (of type T ) by key (of type K )
per the given func key-generating function.
The type of the input argument of func is the type of rows in the Dataset (i.e.
Note
Dataset[T] ).
groupByKey simply applies the func function to every row (of type T ) and associates it
func: T => K
Internally, groupByKey creates a structured query with the AppendColumns unary logical
operator (with the given func and the analyzed logical plan of the target Dataset that
groupByKey was executed on) and creates a new QueryExecution .
The new columns of the AppendColumns logical operator (i.e. the attributes of the key)
257
groupByKey Operator
scala> :type sq
org.apache.spark.sql.Dataset[Long]
// input stream
import java.sql.Timestamp
val signals = spark.
readStream.
format("rate").
option("rowsPerSecond", 1).
load.
withColumn("value", $"value" % 10) // <-- randomize the values (just for fun)
withColumn("deviceId", lit(util.Random.nextInt(10))). // <-- 10 devices randomly ass
igned to values
as[(Timestamp, Long, Int)] // <-- convert to a "better" type (from "unpleasant" Row)
258
groupByKey Operator
259
withWatermark Operator
withWatermark Operator — Event-Time
Watermark
withWatermark(eventTime: String, delayThreshold: String): Dataset[T]
withWatermark specifies the eventTime column for event time watermark and
eventTime specifies the column to use for watermark and can be either part of Dataset
The current watermark is computed by looking at the maximum eventTime seen across all
of the partitions in a query minus a user-specified delayThreshold . Due to the cost of
coordinating this value across partitions, the actual watermark used is only guaranteed to be
at least delayThreshold behind the actual event time.
In some cases Spark may still process records that arrive more than
Note
delayThreshold late.
260
window Function
window(
timeColumn: Column,
windowDuration: String): Column (1)
window(
timeColumn: Column,
windowDuration: String,
slideDuration: String): Column (2)
window(
timeColumn: Column,
windowDuration: String,
slideDuration: String,
startTime: String): Column (3)
Tumbling windows group elements of a stream into finite sets where each
Note set corresponds to an interval.
Tumbling windows discretize a stream into non-overlapping windows.
261
window Function
// https://docs.oracle.com/javase/8/docs/api/java/time/LocalDateTime.html
import java.time.LocalDateTime
// https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html
import java.sql.Timestamp
val levels = Seq(
// (year, month, dayOfMonth, hour, minute, second)
((2012, 12, 12, 12, 12, 12), 5),
((2012, 12, 12, 12, 12, 14), 9),
((2012, 12, 12, 13, 13, 14), 4),
((2016, 8, 13, 0, 0, 0), 10),
((2017, 5, 27, 0, 0, 0), 15)).
map { case ((yy, mm, dd, h, m, s), a) => (LocalDateTime.of(yy, mm, dd, h, m, s), a)
}.
map { case (ts, a) => (Timestamp.valueOf(ts), a) }.
toDF("time", "level")
scala> levels.show
+-------------------+-----+
| time|level|
+-------------------+-----+
|2012-12-12 12:12:12| 5|
|2012-12-12 12:12:14| 9|
|2012-12-12 13:13:14| 4|
|2016-08-13 00:00:00| 10|
|2017-05-27 00:00:00| 15|
+-------------------+-----+
scala> q.printSchema
root
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- level: integer (nullable = false)
262
window Function
| start| end|level_sum|
+-------------------+-------------------+---------+
|2012-12-12 13:13:10|2012-12-12 13:13:15| 4|
|2012-12-12 12:12:10|2012-12-12 12:12:15| 14|
|2016-08-13 00:00:00|2016-08-13 00:00:05| 10|
|2017-05-27 00:00:00|2017-05-27 00:00:05| 15|
+-------------------+-------------------+---------+
windowDuration and slideDuration are strings specifying the width of the window for
4. The slide duration must be less than or equal to the window duration.
Internally, window creates a Column with TimeWindow Catalyst expression under window
alias.
Internally, TimeWindow Catalyst expression is simply a struct type with two fields, i.e. start
and end , both of TimestampType type.
263
window Function
scala> println(windowExpr.dataType)
StructType(StructField(start,TimestampType,true), StructField(end,TimestampType,true))
scala> println(windowExpr.dataType.prettyJson)
{
"type" : "struct",
"fields" : [ {
"name" : "start",
"type" : "timestamp",
"nullable" : true,
"metadata" : { }
}, {
"name" : "end",
"type" : "timestamp",
"nullable" : true,
"metadata" : { }
} ]
}
Example — Traffic Sensor
Note The example is borrowed from Introducing Stream Windows in Apache Flink.
The example shows how to use window function to model a traffic sensor that counts every
15 seconds the number of vehicles passing a certain location.
264
KeyValueGroupedDataset
KeyValueGroupedDataset — Streaming
Aggregation
KeyValueGroupedDataset represents a grouped dataset as a result of Dataset.groupByKey
// Dataset[T]
groupByKey(func: T => K): KeyValueGroupedDataset[K, T]
import java.sql.Timestamp
val numGroups = spark.
readStream.
format("rate").
load.
as[(Timestamp, Long)].
groupByKey { case (time, value) => value % 2 }
KeyValueGroupedDataset.mapValues operators.
KeyValueGroupedDataset works for batch and streaming aggregations, but shines the most
265
KeyValueGroupedDataset
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
numGroups.
mapGroups { case(group, values) => values.size }.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(10.seconds)).
start
-------------------------------------------
Batch: 0
-------------------------------------------
+-----+
|value|
+-----+
+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+
|value|
+-----+
| 3|
| 2|
+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----+
|value|
+-----+
| 5|
| 5|
+-----+
// Eventually...
spark.streams.active.foreach(_.stop)
266
KeyValueGroupedDataset
cogroup[U, R : Encoder](
cogroup other: KeyValueGroupedDataset[K, U])(
f: (K, Iterator[V], Iterator[U]) => TraversableOnce[R]): Dataset
flatMapGroupsWithState
Arbitrary Stateful Streaming Aggregation - streaming aggregation with expli
and state timeout
keys: Dataset[K]
keyAs keyAs[L : Encoder]: KeyValueGroupedDataset[L, V]
267
KeyValueGroupedDataset
QueryExecution
Data attributes
Grouping attributes
268
mapGroupsWithState Operator
mapGroupsWithState Operator — Stateful
Streaming Aggregation (with Explicit State
Logic)
mapGroupsWithState[S: Encoder, U: Encoder](
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U] (1)
mapGroupsWithState[S: Encoder, U: Encoder](
timeoutConf: GroupStateTimeout)(
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]
mapGroupsWithState operator…FIXME
import org.apache.spark.sql.streaming.GroupState
def mappingFunc(key: Long, values: Iterator[(java.sql.Timestamp, Long)], state: GroupS
tate[Long]): Long = {
println(s">>> key: $key => state: $state")
val newState = state.getOption.map(_ + values.size).getOrElse(0L)
state.update(newState)
key
}
import org.apache.spark.sql.streaming.GroupStateTimeout
val longs = numGroups.mapGroupsWithState(
timeoutConf = GroupStateTimeout.ProcessingTimeTimeout)(
func = mappingFunc)
269
mapGroupsWithState Operator
format("console").
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update). // <-- required for mapGroupsWithState
start
// Note GroupState
-------------------------------------------
Batch: 1
-------------------------------------------
>>> key: 0 => state: GroupState(<undefined>)
>>> key: 1 => state: GroupState(<undefined>)
+-----+
|value|
+-----+
| 0|
| 1|
+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
>>> key: 0 => state: GroupState(0)
>>> key: 1 => state: GroupState(0)
+-----+
|value|
+-----+
| 0|
| 1|
+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
>>> key: 0 => state: GroupState(4)
>>> key: 1 => state: GroupState(4)
+-----+
|value|
+-----+
| 0|
| 1|
+-----+
// in the end
spark.streams.active.foreach(_.stop)
270
flatMapGroupsWithState Operator
flatMapGroupsWithState Operator — Arbitrary
Stateful Streaming Aggregation (with Explicit
State Logic)
KeyValueGroupedDataset[K, V].flatMapGroupsWithState[S: Encoder, U: Encoder](
outputMode: OutputMode,
timeoutConf: GroupStateTimeout)(
func: (K, Iterator[V], GroupState[S]) => Iterator[U]): Dataset[U]
Every time the state function func is executed for a key, the state (as GroupState[S] ) is for
this key only.
271
StreamingQueryManager
StreamingQueryManager — Streaming Query
Management
StreamingQueryManager is the management interface for active streaming queries of a
SparkSession.
active: Array[StreamingQuery]
active
Active structured queries
awaitAnyTermination(): Unit
awaitAnyTermination(timeoutMs: Long): Boolean
awaitAnyTermination
Waits until any streaming query terminats or timeoutMs
elapses
removeListener(
listener: StreamingQueryListener): Unit
removeListener
resetTerminated(): Unit
resetTerminated
Resets the internal registry of the terminated streaming
queries (that lets awaitAnyTermination to be used again)
272
StreamingQueryManager
Figure 1. StreamingQueryManager
Tip Read up on SessionState in The Internals of Spark SQL gitbook.
StreamExecution).
listenerBus: StreamingQueryListenerBus
active: Array[StreamingQuery]
274
StreamingQueryManager
De-Registering StreamingQueryListener
— removeListener Method
awaitAnyTermination(): Unit
awaitAnyTermination(timeoutMs: Long): Boolean
reported one.
resetTerminated Method
resetTerminated(): Unit
resetTerminated forgets about the past-terminated query (so that awaitAnyTermination can
275
StreamingQueryManager
createQuery(
userSpecifiedName: Option[String],
userSpecifiedCheckpointLocation: Option[String],
df: DataFrame,
extraOptions: Map[String, String],
sink: BaseStreamingSink,
outputMode: OutputMode,
useTempCheckpointLocation: Boolean,
recoverFromCheckpointLocation: Boolean,
trigger: Trigger,
triggerClock: Clock): StreamingQueryWrapper
defined properties).
Internally, createQuery first finds the name of the checkpoint directory of a query (aka
checkpoint location) in the following order:
If the directory name for the checkpoint location could not be found, createQuery reports a
AnalysisException .
flag is turned off but there is offsets directory in the checkpoint location.
createQuery makes sure that the logical plan of the structured query is analyzed (i.e. no
276
StreamingQueryManager
(only when spark.sql.adaptive.enabled Spark property is turned on) createQuery prints out
a WARN message to the logs:
startQuery(
userSpecifiedName: Option[String],
userSpecifiedCheckpointLocation: Option[String],
df: DataFrame,
extraOptions: Map[String, String],
sink: BaseStreamingSink,
outputMode: OutputMode,
useTempCheckpointLocation: Boolean = false,
recoverFromCheckpointLocation: Boolean = true,
trigger: Trigger = ProcessingTime(0),
triggerClock: Clock = new SystemClock()): StreamingQuery
277
StreamingQueryManager
In the end, startQuery returns the StreamingQueryWrapper (as part of the fluent API so
you can chain operators) or throws the exception that was reported when attempting to start
the query.
Cannot start query with name [name] as a query with that name is already active
postListenerEvent simply posts the input event to the internal event bus for streaming
events (StreamingQueryListenerBus).
278
StreamingQueryManager
registry (when no earlier streaming query was recorded or the terminatedQuery terminated
due to an exception).
279
StreamingQueryManager
Internal Properties
280
StreamingQueryManager
Name Description
activeQueriesLock
awaitTerminationLock
281
SQLConf
configure a Spark Structured Streaming application (and Spark SQL applications in general).
282
SQLConf
Used when:
FlatMapGroupsWithStateStra
execution planning strategy is
requested to plan a streaming
FLATMAPGROUPSWITHSTATE_STATE_FORMAT_VERSION query (and creates a
FlatMapGroupsWithStateExec
spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion physical operator for every
FlatMapGroupsWithState
operator)
Among the checkpointed
properties
Used when:
CompactibleFileStreamLog
minBatchesToRetain
created
spark.sql.streaming.minBatchesToRetain
StreamExecution
StateStoreConf is
SHUFFLE_PARTITIONS
See spark.sql.shuffle.partitions
spark.sql.shuffle.partitions Internals of Spark SQL.
Used (as
stateStoreMinDeltasForSnapshot StateStoreConf.minDeltasForSnap
exclusively when
spark.sql.streaming.stateStore.minDeltasForSnapshot HDFSBackedStateStoreProvider
requested to doSnapshot
Used when:
StateStoreWriter
stateStoreProviderClass
stateStoreCustomMetrics
spark.sql.streaming.stateStore.providerClass StateStoreWriter
the metrics and getProgress
StateStoreConf is
Used when:
StatefulAggregationStrategy
execution planning strategy is
STREAMING_AGGREGATION_STATE_FORMAT_VERSION executed
spark.sql.streaming.aggregation.stateFormatVersion
283
SQLConf
OffsetSeqMetadata
for the relevantSQLConfs
relevantSQLConfDefaultValue
STREAMING_MULTIPLE_WATERMARK_POLICY
spark.sql.streaming.multipleWatermarkPolicy
streamingNoDataProgressEventInterval
Used exclusively for ProgressRep
spark.sql.streaming.noDataProgressEventInterval
streamingPollingDelay
Used exclusively when
spark.sql.streaming.pollingDelay StreamExecution is created
284
Configuration Properties
Configuration Properties
Configuration properties are used to fine-tune Spark Structured Streaming applications.
You can set them for a SparkSession when it is created using config method.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder
.config("spark.sql.streaming.metricsEnabled", true)
.getOrCreate
(internal) CheckpointFileManager
checkpoint files atomically
spark.sql.streaming.checkpointFileManagerClass Default: FileContextBasedCheckpo
(with FileSystemBasedCheckpointF
case of unsupported file system use
metadata files)
285
Configuration Properties
Default: 10
spark.sql.streaming.fileSource.log.compactInterval
Must be a positive value (greater th
Use SQLConf.fileSourceLogCompa
the current value.
286
Configuration Properties
Use SQLConf.streamingMetricsEna
current value
287
Configuration Properties
Use
SQLConf.streamingNoDataMicroBa
to get the current value
Use
SQLConf.streamingNoDataProgres
to get the current value
Number of StreamingQueryProgres
progressBuffer internal registry whe
ProgressReporter is requested to
of streaming query
spark.sql.streaming.numRecentProgressUpdates
Default: 100
Use SQLConf.streamingProgressR
the current value
288
Configuration Properties
Use SQLConf.stateStoreProviderCl
current value.
289
StreamingQueryListener
StreamingQueryListener — Intercepting Life
Cycle Events of Streaming Queries
StreamingQueryListener is the contract of listeners that want to be notified about the life
onQueryStarted(
event: QueryStartedEvent): Unit
onQueryStarted
Informs that DataStreamWriter was requested to start
execution of the streaming query (on the stream execution
thread)
onQueryProgress(
event: QueryProgressEvent): Unit
onQueryProgress
onQueryTerminated(
event: QueryTerminatedEvent): Unit
onQueryTerminated
290
StreamingQueryListener
Posted when
StreamExecution is
QueryStartedEvent
requested to run stream
id processing (when
onQueryStarted DataStreamWriter is
runId requested to start
execution of the
name
streaming query on the
stream execution thread)
Posted when
ProgressReporter is
requested to update
QueryProgressEvent progress of a streaming
onQueryProgress query (after
StreamingQueryProgress MicroBatchExecution has
finished triggerExecution
phase at the end of a
streaming batch)
291
StreamingQueryListener
292
StreamingQueryListener
293
ProgressReporter
ProgressReporter Contract
ProgressReporter is the contract of stream execution progress reporters that report the
currentBatchId: Long
currentBatchId
id: UUID
id
lastExecution: QueryExecution
lastExecution
logicalPlan: LogicalPlan
name: String
name
294
ProgressReporter
Used when:
ProgressReporter extracts statistics from the most
recent query execution (to calculate the so-called
inputRows )
offsetSeqMetadata: OffsetSeqMetadata
offsetSeqMetadata
OffsetSeqMetadata (with the current micro-batch event-time
watermark and timestamp)
Posts StreamingQueryListener.Event
runId: UUID
runId
Universally unique identifier (UUID) of the single run of the
streaming query (that changes every restart)
sink: BaseStreamingSink
sink
The one and only streaming writer or sink of the streaming
query
sources: Seq[BaseStreamingSource]
sources
Streaming readers and sources of the streaming query
Used when finishing a trigger (and updating progress and
marking current status as trigger inactive)
sparkSession: SparkSession
triggerClock: Clock
triggerClock
295
ProgressReporter
property to control how long to wait between two progress events when there is no data
(default: 10000L ) when finishing trigger.
296
ProgressReporter
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sampleQuery = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.ProcessingTime(10.seconds))
.start
scala> println(sampleQuery.lastProgress.sources(0))
res40: org.apache.spark.sql.streaming.SourceProgress =
{
"description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",
"startOffset" : 333,
"endOffset" : 343,
"numInputRows" : 10,
"inputRowsPerSecond" : 0.9998000399920015,
"processedRowsPerSecond" : 200.0
}
// With a hack
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val offsets = sampleQuery.
asInstanceOf[StreamingQueryWrapper].
streamingQuery.
availableOffsets.
map { case (source, offset) =>
s"source = $source => offset = $offset" }
scala> offsets.foreach(println)
source = RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8] => offset =
293
297
ProgressReporter
progressBuffer: Queue[StreamingQueryProgress]
progressBuffer is used when ProgressReporter is requested for the last and the recent
StreamingQueryProgresses
status Method
status: StreamingQueryStatus
298
ProgressReporter
startTrigger(): Unit
299
ProgressReporter
lastTriggerStartTimestamp currentTriggerStartTimestamp
currentTriggerStartOffsets null
currentTriggerEndOffsets null
finishTrigger extractExecutionStats.
finishTrigger calculates the processing time (in seconds) as the difference between the
finishTrigger calculates the input time (in seconds) as the difference between the start
300
ProgressReporter
finishTrigger creates a SourceProgress (aka source statistics) for every source used.
If there was any data (using the input hasNewData flag), finishTrigger resets
lastNoDataProgressEventTime (i.e. becomes the minimum possible time) and updates
query progress.
Otherwise, when no data was available (using the input hasNewData flag), finishTrigger
updates query progress only when lastNoDataProgressEventTime passed.
reportTimeTaken[T](
triggerDetailKey: String)(
body: => T): T
In the end, reportTimeTaken prints out the following DEBUG message to the logs and
returns the result of executing body .
301
ProgressReporter
MicroBatchExecution
triggerExecution
getOffset
setOffsetRange
getEndOffset
Note
walCommit
getBatch
queryPlanning
addBatch
ContinuousExecution
queryPlanning
runContinuous
registry.
302
ProgressReporter
query.
extractExecutionStats extractStateOperatorMetrics.
extractExecutionStats extractSourceToNumInputRows.
extractStateOperatorMetrics(
hasNewData: Boolean): Seq[StateOperatorProgress]
( executedPlan ) and finds all StateStoreWriter physical operators and requests them for
StateOperatorProgress.
uninitialized ( null ).
303
ProgressReporter
extractSourceToNumInputRows …FIXME
formatTimestamp …FIXME
recordTriggerOffsets(
from: StreamProgress,
to: StreamProgress): Unit
304
ProgressReporter
lastProgress: StreamingQueryProgress
lastProgress …FIXME
recentProgress Method
recentProgress: Array[StreamingQueryProgress]
recentProgress …FIXME
Internal Properties
Name Description
scala.collection.mutable.HashMap of action names (aka
triggerDetailKey) and their cumulative times (in milliseconds).
scala> query.lastProgress.durationMs
res3: java.util.Map[String,Long] =
Tip {triggerExecution=60,
queryPlanning=1, getBatch=5,
getOffset=0, addBatch=30,
walCommit=23}
currentTriggerEndOffsets
305
ProgressReporter
Default: -1L
Flag to…FIXME
metricWarningLogged
Default: false
306
StreamingQueryProgress
StreamingQueryProgress
StreamingQueryProgress holds information about the progress of a streaming query.
307
StreamingQueryProgress
timestamp Time when the trigger has started (in ISO8601 format).
stateOperators Information about stateful operators in the query that store state.
sources
Statistics about the data read from every streaming source in a
streaming query
308
ExecutionStats
ExecutionStats
ExecutionStats is…FIXME
309
SourceProgress
SourceProgress
SourceProgress is…FIXME
310
SinkProgress
SinkProgress
SinkProgress is…FIXME
311
StreamingQueryStatus
StreamingQueryStatus
StreamingQueryStatus is…FIXME
312
MetricsReporter
MetricsReporter
MetricsReporter is…FIXME
313
Web UI
Web UI
Web UI…FIXME
Caution FIXME What’s visible on the plan diagram in the SQL tab of the UI
314
Logging
Logging
Caution FIXME
315
FileStreamSource
FileStreamSource
FileStreamSource is a Source that reads text files from path directory as they appear. It
You can provide the schema of the data and dataFrameBuilder - the function to build a
DataFrame in getBatch at instantiation time.
val df = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load("text-logs")
scala> df.printSchema
root
|-- value: string (nullable = true)
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", LongType, nullable = false) ::
StructField("name", StringType, nullable = false) ::
StructField("score", DoubleType, nullable = false) :: Nil)
scala> input.isStreaming
res9: Boolean = true
316
FileStreamSource
Refer to Logging.
Caution FIXME
Options
maxFilesPerTrigger
maxFilesPerTrigger option specifies the maximum number of files per trigger (batch). It
limits the file stream source to read the maxFilesPerTrigger number of files specified at a
time and hence enables rate limiting.
It allows for a static set of files be used like a stream for testing as the file set is processed
maxFilesPerTrigger number of files at a time.
schema
If the schema is specified at instantiation time (using optional dataSchema constructor
parameter) it is returned.
Otherwise, fetchAllFiles internal method is called to list all the files in a directory.
When there is at least one file the schema is calculated using dataFrameBuilder constructor
parameter function. Else, an IllegalArgumentException("No schema specified") is thrown
unless it is for text provider (as providerName constructor parameter) where the default
schema with a single value column of type StringType is assumed.
getOffset Method
317
FileStreamSource
getOffset: Option[Offset]
Note getOffset is part of the Source Contract to find the latest offset.
getOffset …FIXME
The maximum offset ( getOffset ) is calculated by fetching all the files in path excluding
files that start with _ (underscore).
When computing the maximum offset using getOffset , you should see the following
DEBUG message in the logs:
When computing the maximum offset using getOffset , it also filters out the files that were
already seen (tracked in seenFiles internal registry).
You should see the following DEBUG message in the logs (depending on the status of a
file):
You should see the following INFO and DEBUG messages in the logs:
The method to create a result batch is given at instantiation time (as dataFrameBuilder
constructor parameter).
metadataLog
metadataLog is a metadata storage using metadataPath path (which is a constructor
parameter).
318
FileStreamSource
fetchMaxOffset(): FileStreamSourceOffset
fetchMaxOffset …FIXME
fetchAllFiles …FIXME
allFilesUsingMetadataLogFileIndex Internal
Method
allFilesUsingMetadataLogFileIndex(): Seq[FileStatus]
requests it to allFiles .
319
FileStreamSink
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = in.
writeStream.
format("parquet").
option("path", "parquet-output-dir").
option("checkpointLocation", "checkpoint-dir").
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Append).
start
Refer to Logging.
320
FileStreamSink
SparkSession
Root directory
FileFormat
Configuration options
addBatch(
batchId: Long,
data: DataFrame): Unit
Note addBatch is a part of Sink Contract to "add" a batch of data to the sink.
addBatch …FIXME
Creating BasicWriteJobStatsTracker
— basicWriteJobStatsTracker Internal Method
basicWriteJobStatsTracker: BasicWriteJobStatsTracker
metrics:
321
FileStreamSink
hasMetadata(
path: Seq[String],
hadoopConf: Configuration): Boolean
hasMetadata …FIXME
Internal Properties
Name Description
Base path (Hadoop’s Path for the given path)
basePath
Used when…FIXME
Metadata log path (Hadoop’s Path for the base path and
logPath the _spark_metadata)
Hadoop’s Configuration
hadoopConf
Used when…FIXME
322
FileStreamSinkLog
FileStreamSinkLog
FileStreamSinkLog is a concrete CompactibleFileStreamLog (of SinkFileStatuses) for
FileStreamSinkLog uses delete action to mark metadata logs that should be excluded from
compaction.
created:
Metadata version
SparkSession
compactLogs Method
compactLogs …FIXME
323
SinkFileStatus
SinkFileStatus
SinkFileStatus represents the status of files of FileStreamSink (and the type of the
metadata of FileStreamSinkLog):
Path
Size
isDir flag
Modification time
Block replication
Block size
toFileStatus Method
toFileStatus: FileStatus
324
ManifestFileCommitProtocol
ManifestFileCommitProtocol
ManifestFileCommitProtocol is…FIXME
commitJob Method
commitJob(
jobContext: JobContext,
taskCommits: Seq[TaskCommitMessage]): Unit
commitJob …FIXME
commitTask Method
commitTask(
taskContext: TaskAttemptContext): TaskCommitMessage
commitTask …FIXME
325
MetadataLogFileIndex
MetadataLogFileIndex
MetadataLogFileIndex is a PartitioningAwareFileIndex of metadata log files (generated by
FileStreamSink).
Refer to Logging.
SparkSession
Hadoop’s Path
While being created, MetadataLogFileIndex prints out the following INFO message to the
logs:
Internal Properties
326
MetadataLogFileIndex
Name Description
327
Kafka Data Source
Kafka Data Source provides a streaming source and a streaming sink for micro-batch and
continuous stream processing.
You should define spark-sql-kafka-0-10 module as part of the build definition in your Spark
project, e.g. as a libraryDependency in build.sbt for sbt:
For Spark environments like spark-submit (and "derivatives" like spark-shell ), you should
use --packages command-line option:
Streaming Source
With spark-sql-kafka-0-10 module you can use kafka data source format for loading data
(reading records) from one or more Kafka topics as a streaming Dataset.
328
Kafka Data Source
Internally, the kafka data source format for reading is available through
KafkaSourceProvider that is a MicroBatchReadSupport and ContinuousReadSupport for
micro-batch and continuous stream processing, respectively.
Table 1. Kafka Data Source’s Fixed Schema (in the positional order)
Name Type
key BinaryType
value BinaryType
topic StringType
partition IntegerType
offset LongType
timestamp TimestampType
timestampType IntegerType
scala> records.printSchema
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Internally, the fixed schema is defined as part of the DataSourceReader contract through
MicroBatchReader and ContinuousReader extension contracts for micro-batch and
continuous stream processing, respectively.
329
Kafka Data Source
Streaming Sink
With spark-sql-kafka-0-10 module you can use kafka data source format for writing the
result of executing a streaming query (a streaming Dataset) to one or more Kafka topics.
val sq = records
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", ":9092")
.option("topic", "kafka2console-output")
.option("checkpointLocation", "checkpointLocation-kafka2console")
.start
Internally, the kafka data source format for writing is available through KafkaSourceProvider
that is a StreamWriteSupport.
330
Kafka Data Source
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("kafka")
.option("subscribepattern", "kafka2console.*")
.option("kafka.bootstrap.servers", ":9092")
.load
.withColumn("value", $"value" cast "string") // deserializing values
.writeStream
.format("console")
.option("truncate", false) // format-specific option
.option("checkpointLocation", "checkpointLocation-kafka2console") // generic query o
ption
.trigger(Trigger.ProcessingTime(30.seconds))
.queryName("kafka2console-microbatch")
.start
Kafka Data Source can assign a single task per Kafka partition (using
KafkaOffsetRangeCalculator in Micro-Batch Stream Processing).
Kafka Data Source can reuse a Kafka consumer (using KafkaMicroBatchReader in Micro-
Batch Stream Processing).
331
Kafka Data Source
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("kafka")
.option("subscribepattern", "kafka2console.*")
.option("kafka.bootstrap.servers", ":9092")
.load
.withColumn("value", $"value" cast "string") // convert bytes to string for display
purposes
.writeStream
.format("console")
.option("truncate", false) // format-specific option
.option("checkpointLocation", "checkpointLocation-kafka2console") // generic query o
ption
.queryName("kafka2console-continuous")
.trigger(Trigger.Continuous(10.seconds))
.start
Configuration Options
{"topicA":[0,1],"topicB":[0,1]}
assign
332
Kafka Data Source
Default: (empty)
Starting offsets
Default: latest
Possible values:
latest
earliest
startingOffsets {"topicA":{"part":offset,"p1":-1},"topicB":{"0":-2}}
option(
Tip "startingOffsets",
"""{"topic1":{"0":5,"4":-1},"topic2":{"0":-2}}"
)
333
Kafka Data Source
topic1,topic2,topic3
subscribe
topic\d
topic
Default: (empty)
334
Kafka Data Source
335
Kafka Data Source
336
KafkaSourceProvider
KafkaSourceProvider requires the following options (that you can set using option method
of DataStreamReader or DataStreamWriter):
kafka.bootstrap.servers
Tip Refer to Kafka Data Source’s Options for the supported configuration options.
Internally, KafkaSourceProvider sets the properties for Kafka Consumers on executors (that
are passed on to InternalKafkaConsumer when requested to create a Kafka consumer with a
single TopicPartition manually assigned).
uniqueGroupId-
GROUP_ID_CONFIG
executor FIXME
337
KafkaSourceProvider
Refer to Logging.
createSource(
sqlContext: SQLContext,
metadataPath: String,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): Source
createSource …FIXME
Note Parameters are case-insensitive, i.e. OptioN and option are equal.
validateGeneralOptions makes sure that exactly one topic subscription strategy is used in
subscribe
subscribepattern
assign
338
KafkaSourceProvider
validateGeneralOptions makes sure that the value of subscription strategies meet the
requirements:
subscribe strategy has at least one topic (in a comma-separated list of topics)
validateGeneralOptions makes sure that group.id has not been specified and reports an
IllegalArgumentException otherwise.
Kafka option 'group.id' is not supported as user-specified consumer groups are not use
d to track offsets.
validateGeneralOptions makes sure that auto.offset.reset has not been specified and
validateGeneralOptions makes sure that the following options have not been specified and
kafka.key.deserializer
kafka.value.deserializer
kafka.enable.auto.commit
kafka.interceptor.classes
339
KafkaSourceProvider
Internally, strategy finds the keys in the input caseInsensitiveParams that are one of the
following and creates a corresponding ConsumerStrategy.
{"topicA":[0,1],"topicB":[0,1]}
topic1,topic2,topic3
topic\d
340
KafkaSourceProvider
sourceSchema(
sqlContext: SQLContext,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): (String, StructType)
sourceSchema gives the short name (i.e. kafka ) and the fixed schema.
Internally, sourceSchema validates Kafka options and makes sure that the optional input
schema is indeed undefined.
Kafka source has a fixed schema and cannot be set with a custom one
341
KafkaSourceProvider
createContinuousReader(
schema: Optional[StructType],
metadataPath: String,
options: DataSourceOptions): KafkaContinuousReader
createContinuousReader …FIXME
getKafkaOffsetRangeLimit(
params: Map[String, String],
offsetOptionKey: String,
defaultOffsets: KafkaOffsetRangeLimit): KafkaOffsetRangeLimit
getKafkaOffsetRangeLimit finds the given offsetOptionKey in the params and does the
following conversion:
342
KafkaSourceProvider
createMicroBatchReader(
schema: Optional[StructType],
metadataPath: String,
options: DataSourceOptions): KafkaMicroBatchReader
createMicroBatchReader finds all the parameters (in the given DataSourceOptions ) that start
with kafka. prefix, removes the prefix, and creates the current Kafka parameters.
Properties for Kafka consumers on the driver (given the current Kafka parameters, i.e.
without kafka. prefix)
the KafkaOffsetReader
Properties for Kafka consumers on executors (given the current Kafka parameters, i.e.
without kafka. prefix) and the unique group ID ( spark-kafka-source-[randomUUID]-
[metadataPath_hashCode]-driver )
343
KafkaSourceProvider
createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
createRelation …FIXME
validateBatchOptions …FIXME
kafkaParamsForDriver Method
kafkaParamsForDriver …FIXME
kafkaParamsForExecutors Method
kafkaParamsForExecutors(
specifiedKafkaParams: Map[String, String],
uniqueGroupId: String): Map[String, Object]
While setting the properties, kafkaParamsForExecutors prints out the following DEBUG
message to the logs:
344
KafkaSourceProvider
345
KafkaSource
KafkaSource
KafkaSource is a streaming source that generates DataFrames of records from one or more
Kafka topics are checked for new records every trigger and so there is some
Note noticeable delay between when the records have arrived to Kafka topics and
when a Spark application processes them.
KafkaSource uses the streaming metadata log directory to persist offsets. The directory is
the source ID under the sources directory in the checkpointRoot (of the StreamExecution).
Refer to Logging.
SQLContext
KafkaOffsetReader
346
KafkaSource
Streaming metadata log directory, i.e. the directory for streaming metadata log
(where KafkaSource persists KafkaSourceOffset offsets in JSON format)
Flag used to create KafkaSourceRDDs every trigger and when checking to report a
IllegalStateException on data loss.
getBatch(
start: Option[Offset],
end: Offset): DataFrame
getBatch creates a streaming DataFrame with a query plan with LogicalRDD logical
getBatch requests KafkaSourceOffset for end partition offsets for the input end offset
(known as untilPartitionOffsets ).
getBatch requests KafkaSourceOffset for start partition offsets for the input start offset (if
getBatch finds the new partitions (as the difference between the topic partitions in
getBatch reports a data loss if the new partitions don’t match to what KafkaOffsetReader
fetched.
347
KafkaSource
Cannot find earliest offsets of [partitions]. Some data may have been missed
getBatch reports a data loss if the new partitions don’t have their offsets 0 .
Added partition [partition] starts from [offset] instead of 0. Some data may have been
missed
untilPartitionOffsets partitions.
TopicPartitions: [topicPartitions]
getBatch gets the executors (sorted by executorId and host of the registered block
managers).
getBatch filters out KafkaSourceRDDOffsetRanges for which until offsets are smaller than
348
KafkaSource
In the end, getBatch creates a streaming DataFrame for the KafkaSourceRDD and the
schema.
getOffset: Option[Offset]
Internally, getOffset fetches the initial partition offsets (from the metadata log or Kafka
directly).
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
349
KafkaSource
350
KafkaSource
// And start over to see what offset the query starts from
// Checkpoint location should have the offsets
val q = records.
writeStream.
format("console").
option("truncate", false).
option("checkpointLocation", "/tmp/checkpoint"). // <-- checkpoint directory
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update).
start
// Whoops...console format does not support recovery (!)
// Reported as https://issues.apache.org/jira/browse/SPARK-21667
org.apache.spark.sql.AnalysisException: This query does not support recovering from ch
351
KafkaSource
option.
352
KafkaSource
specificOffsets .
fetchAndVerify makes sure that the starting offsets in specificOffsets are the same as in
353
KafkaSource
initialPartitionOffsets is the initial partition offsets for the batch 0 that were already
(as KafkaSourceOffset ).
HDFSMetadataLog.serialize
354
KafkaSource
serialize(
metadata: KafkaSourceOffset,
out: OutputStream): Unit
serialize requests the OutputStream to write a zero byte (to support Spark 2.1.0 as per
SPARK-19517).
serialize requests the BufferedWriter to write the v1 version indicator followed by a new
line.
In the end, serialize requests the BufferedWriter to flush (the underlying stream).
rateLimit(
limit: Long,
from: Map[TopicPartition, Long],
until: Map[TopicPartition, Long]): Map[TopicPartition, Long]
Caution FIXME
getSortedExecutorList Method
Caution FIXME
Caution FIXME
355
KafkaSource
Internal Properties
Name Description
Current partition offsets (as Map[TopicPartition, Long] )
currentPartitionOffsets Initially NONE and set when KafkaSource is requested to
get the maximum available offsets or generate a
DataFrame with records from Kafka for a batch.
pollTimeoutMs
356
KafkaRelation
KafkaRelation
KafkaRelation represents a collection of rows with a predefined schema ( BaseRelation )
BaseRelation.
Refer to Logging.
SQLContext
ConsumerStrategy
failOnDataLoss flag
Starting offsets
Ending offsets
357
KafkaRelation
getPartitionOffsets(
kafkaReader: KafkaOffsetReader,
kafkaOffsets: KafkaOffsetRangeLimit): Map[TopicPartition, Long]
Caution FIXME
buildScan(): RDD[Row]
buildScan is part of the TableScan contract to build a distributed data scan with
Note
column pruning.
(to make sure that a streaming query creates a new consumer group).
buildScan uses the KafkaOffsetReader to getPartitionOffsets for the starting and ending
Kafka TopicPartition , beginning and ending offsets and undefined preferred location).
358
KafkaRelation
Kafka parameters for executors based on the given specifiedKafkaParams and the
unique group ID ( spark-kafka-relation-[randomUUID] )
pollTimeoutMs configuration
In the end, buildScan requests the SQLContext to create a DataFrame (with the name
kafka and the predefined schema) that is immediately converted to a RDD[InternalRow] .
different topic partitions for starting offsets topics[[fromTopics]] and ending offset
s topics[[untilTopics]]
359
KafkaSourceRDD
KafkaSourceRDD
KafkaSourceRDD is an RDD of Kafka’s ConsumerRecords ( RDD[ConsumerRecord[Array[Byte],
SparkContext
Collection of key-value settings for executors reading records from Kafka topics
Used when KafkaSourceRDD is requested for records (for given offsets) and in turn
requests CachedKafkaConsumer to poll for Kafka’s ConsumerRecords .
Flag to…FIXME
Flag to…FIXME
getPreferredLocations(
split: Partition): Seq[String]
FIXME
360
KafkaSourceRDD
compute(
thePart: Partition,
context: TaskContext
): Iterator[ConsumerRecord[Array[Byte], Array[Byte]]]
partition).
compute resolves the range (based on the offsetRange of the given partition that is
assumed a KafkaSourceRDDPartition ).
record.
When the beginning and ending offsets (of the offset range) are equal, compute prints out
the following INFO message to the logs, requests the KafkaDataConsumer to release and
returns an empty iterator.
Beginning offset [fromOffset] is the same as ending offset skipping [topic] [partition]
compute throws an AssertionError when the beginning offset ( fromOffset ) is after the
getPartitions Method
getPartitions: Array[Partition]
getPartitions …FIXME
361
KafkaSourceRDD
persist: Array[Partition]
persist …FIXME
resolveRange(
consumer: KafkaDataConsumer,
range: KafkaSourceRDDOffsetRange
): KafkaSourceRDDOffsetRange
resolveRange …FIXME
362
CachedKafkaConsumer
CachedKafkaConsumer
Caution FIXME
Caution FIXME
Caution FIXME
363
KafkaSourceOffset
KafkaSourceOffset
KafkaSourceOffset is a custom Offset for kafka data source.
mergeOffsets
and getOrCreateInitialPartitionOffsets
KafkaSource is requested for the initial partition offsets (of 0th batch) and getOffset
an InputStream)
getPartitionOffsets(
offset: Offset): Map[TopicPartition, Long]
KafkaSourceOffset or SerializedOffset .
364
KafkaSourceOffset
json: String
json …FIXME
apply(
offsetTuples: (String, Int, Long)*): KafkaSourceOffset (1)
apply(
offset: SerializedOffset): KafkaSourceOffset
apply …FIXME
365
KafkaOffsetReader
KafkaOffsetReader
KafkaOffsetReader relies on the ConsumerStrategy to create a Kafka Consumer.
and createContinuousReader
Refer to Logging.
ConsumerStrategy
Kafka parameters (as name-value pairs that are used exclusively to create a Kafka
consumer
366
KafkaOffsetReader
nextGroupId(): String
nextGroupId sets the groupId to be the driverGroupIdPrefix, - followed by the nextId (i.e.
[driverGroupIdPrefix]-[nextId] ).
In the end, nextGroupId increments the nextId and returns the groupId.
resetConsumer(): Unit
resetConsumer …FIXME
fetchTopicPartitions Method
fetchTopicPartitions(): Set[TopicPartition]
Caution FIXME
Caution FIXME
367
KafkaOffsetReader
Caution FIXME
withRetriesWithoutInterrupt(
body: => Map[TopicPartition, Long]): Map[TopicPartition, Long]
withRetriesWithoutInterrupt …FIXME
fetchSpecificOffsets(
partitionOffsets: Map[TopicPartition, Long],
reportDataLoss: String => Unit): KafkaSourceOffset
368
KafkaOffsetReader
Consumer.assignment() ).
For every partition offset in the input partitionOffsets , fetchSpecificOffsets requests the
Kafka Consumer to:
369
KafkaOffsetReader
Since consumer method is used (to access the internal Kafka Consumer) in the
fetch methods that gives the property of creating a new Kafka Consumer
Note
whenever the internal Kafka Consumer reference become null , i.e. as in
resetConsumer.
consumer …FIXME
close(): Unit
close stop the Kafka Consumer (if the Kafka Consumer is available).
runUninterruptibly …FIXME
stopConsumer(): Unit
stopConsumer …FIXME
370
KafkaOffsetReader
toString: String
toString …FIXME
Internal Properties
Name Description
Kafka’s Consumer ( Consumer[Array[Byte],
Array[Byte]] )
execContext scala.concurrent.ExecutionContextExecutorService
groupId
kafkaReaderThread java.util.concurrent.ExecutorService
maxOffsetFetchAttempts
nextId Initially 0
offsetFetchAttemptIntervalMs
371
ConsumerStrategy
AssignStrategy
Uses KafkaConsumer.assign(Collection<TopicPartition>
partitions)
SubscribeStrategy
Uses KafkaConsumer.subscribe(Collection<String>
topics)
372
KafkaSink
KafkaSink
KafkaSink is a streaming sink that KafkaSourceProvider registers as the kafka format.
// in another terminal
$ echo hello > server-logs/hello.out
SQLContext
addBatch Method
Internally, addBatch requests KafkaWriter to write the input data to the topic (if defined)
or a topic in executorKafkaParams.
Note addBatch is a part of Sink Contract to "add" a batch of data to the sink.
373
KafkaSink
374
KafkaOffsetRangeLimit
KafkaOffsetRangeLimit — Desired Offset
Range Limits
KafkaOffsetRangeLimit represents the desired offset range limits for starting, ending, and
Table 1. KafkaOffsetRangeLimits
KafkaOffsetRangeLimit Description
EarliestOffsetRangeLimit Intent to bind to the earliest offset
SpecificOffsetRangeLimit
-1 or KafkaOffsetRangeLimit.LATEST - the latest
offset
-2 or KafkaOffsetRangeLimit.EARLIEST - the earliest
offset
KafkaOffsetRangeLimits
375
KafkaOffsetRangeLimit
376
KafkaDataConsumer
KafkaDataConsumer
KafkaDataConsumer is the abstraction of Kafka consumers that use InternalKafkaConsumer
internalConsumer: InternalKafkaConsumer
internalConsumer
Used when…FIXME
release(): Unit
release
Used when…FIXME
Table 2. KafkaDataConsumers
KafkaDataConsumer Description
CachedKafkaDataConsumer
NonCachedKafkaDataConsumer
acquire(
topicPartition: TopicPartition,
kafkaParams: ju.Map[String, Object],
useCache: Boolean
): KafkaDataConsumer
acquire …FIXME
377
KafkaDataConsumer
get(
offset: Long,
untilOffset: Long,
pollTimeoutMs: Long,
failOnDataLoss: Boolean
): ConsumerRecord[Array[Byte], Array[Byte]]
get …FIXME
378
KafkaMicroBatchReader
KafkaMicroBatchReader
KafkaMicroBatchReader is the MicroBatchReader for kafka data source for Micro-Batch
Stream Processing.
create a MicroBatchReader.
Refer to Logging.
KafkaOffsetReader
DataSourceOptions
Metadata Path
failOnDataLoss option
379
KafkaMicroBatchReader
readSchema Method
readSchema(): StructType
stop(): Unit
planInputPartitions(): java.util.List[InputPartition[InternalRow]]
planInputPartitions first finds the new partitions ( TopicPartitions that are in the
planInputPartitions then prints out the following DEBUG message to the logs:
380
KafkaMicroBatchReader
getSortedExecutorList(): Array[String]
get the peers (the other nodes in a Spark cluster), creates a ExecutorCacheTaskLocation for
every pair of host and executor ID, and in the end, sort it in descending order.
getOrCreateInitialPartitionOffsets Internal
Method
getOrCreateInitialPartitionOffsets(): PartitionOffsetMap
getOrCreateInitialPartitionOffsets …FIXME
getStartOffset Method
getStartOffset: Offset
381
KafkaMicroBatchReader
getStartOffset …FIXME
getEndOffset Method
getEndOffset: Offset
Note getEndOffset is part of the MicroBatchReader Contract to get the end offsets.
getEndOffset …FIXME
deserializeOffset Method
deserializeOffset …FIXME
Internal Properties
Name Description
Ending offsets for the assigned partitions
endPartitionOffsets ( Map[TopicPartition, Long] )
Used when…FIXME
382
KafkaMicroBatchReader
383
KafkaOffsetRangeCalculator
KafkaOffsetRangeCalculator
KafkaOffsetRangeCalculator is created for KafkaMicroBatchReader to calculate offset
getRanges(
fromOffsets: PartitionOffsetMap,
untilOffsets: PartitionOffsetMap,
executorLocations: Seq[String] = Seq.empty): Seq[KafkaOffsetRange]
getRanges finds the common TopicPartitions that are the keys that are used in the
For every common TopicPartition , getRanges creates a KafkaOffsetRange with the from
and until offsets from the fromOffsets and untilOffsets collections (and the preferredLoc
undefined). getRanges filters out the TopicPartitions that have no records to consume (i.e.
the difference between until and from offsets is not greater than 0 ).
getRanges branches off based on the defined minimum number of partitions per executor
For the minimum number of partitions per executor undefined or smaller than the number of
KafkaOffsetRanges ( TopicPartitions to consume records from), getRanges updates every
KafkaOffsetRange with the preferred executor based on the TopicPartition and the
executorLocations ).
Otherwise (with the minimum number of partitions per executor defined and greater than the
number of KafkaOffsetRanges ), getRanges splits KafkaOffsetRanges into smaller ones.
384
KafkaOffsetRangeCalculator
TopicPartition
fromOffset offset
untilOffset offset
KafkaOffsetRange knows the size, i.e. the number of records between the untilOffset and
fromOffset offsets.
getLocation(
tp: TopicPartition,
executorLocations: Seq[String]): Option[String]
getLocation …FIXME
385
KafkaMicroBatchInputPartition
KafkaMicroBatchInputPartition
KafkaMicroBatchInputPartition is an InputPartition (of InternalRows ) that is used
KafkaOffsetRange
failOnDataLoss flag
reuseKafkaConsumer flag
contract).
386
KafkaMicroBatchInputPartitionReader
KafkaMicroBatchInputPartitionReader
KafkaMicroBatchInputPartitionReader is an InputPartitionReader (of InternalRows ) that is
KafkaOffsetRange
failOnDataLoss flag
reuseKafkaConsumer flag
next Method
next(): Boolean
next checks whether the KafkaDataConsumer should poll records or not (i.e. nextOffset is
If so, next requests the KafkaDataConsumer to get (poll) records in the range of nextOffset
and the untilOffset (of the KafkaOffsetRange) with the given pollTimeoutMs and
failOnDataLoss.
387
KafkaMicroBatchInputPartitionReader
If the nextOffset is equal or larger than the untilOffset (of the KafkaOffsetRange), next
simply returns false .
close(): Unit
resolveRange(
range: KafkaOffsetRange): KafkaOffsetRange
resolveRange …FIXME
Internal Properties
388
KafkaMicroBatchInputPartitionReader
Name Description
converter KafkaRecordToUnsafeRowConverter
rangeToRead KafkaOffsetRange
389
KafkaSourceInitialOffsetWriter
KafkaSourceInitialOffsetWriter
KafkaSourceInitialOffsetWriter is a Hadoop DFS-based metadata storage for
KafkaSourceOffsets.
requested to getOrCreateInitialPartitionOffsets.
SparkSession
deserialize(
in: InputStream): KafkaSourceOffset
deserialize …FIXME
390
KafkaContinuousReader
KafkaContinuousReader — ContinuousReader
for Kafka Data Source in Continuous Stream
Processing
KafkaContinuousReader is a ContinuousReader for Kafka Data Source in Continuous Stream
Processing.
create a ContinuousReader.
Refer to Logging.
KafkaOffsetReader
Metadata path
Initial offsets
failOnDataLoss flag
planInputPartitions(): java.util.List[InputPartition[InternalRow]]
391
KafkaContinuousReader
planInputPartitions …FIXME
setStartOffset Method
setStartOffset(
start: Optional[Offset]): Unit
setStartOffset …FIXME
deserializeOffset Method
deserializeOffset(
json: String): Offset
deserializeOffset …FIXME
mergeOffsets Method
mergeOffsets(
offsets: Array[PartitionOffset]): Offset
mergeOffsets …FIXME
392
KafkaContinuousInputPartition
KafkaContinuousInputPartition
KafkaContinuousInputPartition is…FIXME
393
TextSocketSourceProvider
TextSocketSourceProvider
TextSocketSourceProvider is a StreamSourceProvider for TextSocketSource that read
It requires two mandatory options (that you can set using option method):
includeTimestamp Option
Caution FIXME
createSource
createSource grabs the two mandatory options — host and port — and returns an
TextSocketSource.
sourceSchema
sourceSchema returns textSocket as the name of the source and the schema that can be
1. SCHEMA_REGULAR (default) which is a schema with a single value field of String type.
default. The schema are value field of StringType type and timestamp field of
TimestampType type of format yyyy-MM-dd HH:mm:ss .
Internally, it starts by printing out the following WARN message to the logs:
394
TextSocketSourceProvider
WARN TextSocketSourceProvider: The socket source should not be used for production app
lications! It does not support recovery and stores state indefinitely.
It then checks whether host and port parameters are defined and if not it throws a
AnalysisException :
395
TextSocketSource
TextSocketSource
TextSocketSource is a streaming source that reads lines from a socket at the host and
It uses lines internal in-memory buffer to keep all of the lines that were read from a socket
forever.
This source is not for production use due to design contraints, e.g. infinite in-
memory collection of lines read and no fault recovery.
Caution
It is designed only for tutorials and debugging.
396
TextSocketSource
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.getOrCreate()
// Connect to localhost:9999
// You can use "nc -lk 9999" for demos
val textSocket = spark.
readStream.
format("socket").
option("host", "localhost").
option("port", 9999).
load
import org.apache.spark.sql.Dataset
val lines: Dataset[String] = textSocket.as[String].map(_.toUpperCase)
-------------------------------------------
Batch: 0
-------------------------------------------
+---------+
| value|
+---------+
|UPPERCASE|
+---------+
scala> query.explain
== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, Str
ingType, fromString, input[0, java.lang.String, true], true) AS value#21]
+- *MapElements <function1>, obj#20: java.lang.String
+- *DeserializeToObject value#43.toString, obj#19: java.lang.String
+- LocalTableScan [value#43]
scala> query.stop
lines is the internal buffer of all the lines TextSocketSource read from the socket.
397
TextSocketSource
TextSocketSource 's offset can either be none or LongOffset of the number of lines in the
2. value field of StringType type and timestamp field of TimestampType type of format
yyyy-MM-dd HH:mm:ss .
TextSocketSource(
host: String,
port: Int,
includeTimestamp: Boolean,
sqlContext: SQLContext)
1. host
2. port
3. includeTimestamp flag
4. SQLContext
It appears that the source did not get "renewed" to use SparkSession
Caution
instead.
It opens a socket at given host and port parameters and reads a buffering character-
input stream using the default charset and the default-sized input buffer (of 8192 bytes) line
by line.
398
TextSocketSource
399
RateSourceProvider
RateSourceProvider
RateSourceProvider is a StreamSourceProvider for RateStreamSource (that acts as the
400
RateStreamSource
RateStreamSource
RateStreamSource is a streaming source that generates consecutive numbers with
rampUpTime 0 (seconds)
rowsPerSecond 1
Number of rows to generate per second
(has to be greater than 0 )
value LongType
401
RateStreamSource
lastTimeMs
maxSeconds
startTimeMs
Refer to Logging.
getOffset: Option[Offset]
Caution FIXME
Internally, getBatch calculates the seconds to start from and end at (from the input start
and end offsets) or assumes 0 .
getBatch then calculates the values to generate for the start and end seconds.
402
RateStreamSource
If the start and end ranges are equal, getBatch creates an empty DataFrame (with the
schema) and returns.
Otherwise, when the ranges are different, getBatch creates a DataFrame using
SparkContext.range operator (for the start and end ranges and numPartitions partitions).
SQLContext
Number of partitions
403
RateStreamMicroBatchReader
RateStreamMicroBatchReader
RateStreamMicroBatchReader is…FIXME
404
ConsoleSinkProvider
ConsoleSinkProvider
ConsoleSinkProvider is a DataSourceV2 with StreamWriteSupport for console data source
format.
source format.
import org.apache.spark.sql.streaming.Trigger
val q = spark
.readStream
.format("rate")
.load
.writeStream
.format("console") // <-- requests ConsoleSinkProvider for a sink
.trigger(Trigger.Once)
.start
scala> println(q.lastProgress.sink)
{
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@2392cf
b1"
}
ConsoleSinkProvider is a CreatableRelationProvider.
createRelation Method
createRelation(
sqlContext: SQLContext,
mode: SaveMode,
parameters: Map[String, String],
data: DataFrame): BaseRelation
createRelation …FIXME
405
ConsoleSinkProvider
406
ConsoleWriter
ConsoleWriter
ConsoleWriter is a StreamWriter for console data source format.
407
ForeachWriterProvider
ForeachWriterProvider
ForeachWriterProvider is…FIXME
408
ForeachWriter
ForeachWriter
ForeachWriter is the contract for a foreach writer that is a streaming format that controls
streaming writes.
ForeachWriter Contract
package org.apache.spark.sql
409
ForeachSink
ForeachSink
ForeachSink is a typed streaming sink that passes rows (of the type T ) to ForeachWriter
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
override def open(partitionId: Long, version: Long) = true
override def process(value: String) = println(value)
override def close(errorOrNull: Throwable) = {}
}
records.writeStream
.queryName("server-logs processor")
.foreach(writer)
.start
Internally, addBatch (the only method from the Sink Contract) takes records from the input
DataFrame (as data ), transforms them to expected type T (of this ForeachSink ) and
(now as a Dataset) processes each partition.
addBatch then opens the constructor’s ForeachWriter (for the current partition and the input
batch) and passes the records to process (one at a time per partition).
FIXME Why does Spark track whether the writer failed or not? Why couldn’t
Caution
it finally and do close ?
410
ForeachSink
411
ForeachBatchSink
ForeachBatchSink
ForeachBatchSink is a streaming sink that is used for the DataStreamWriter.foreachBatch
streaming operator.
import org.apache.spark.sql.Dataset
val q = spark.readStream
.format("rate")
.load
.writeStream
.foreachBatch { (output: Dataset[_], batchId: Long) => // <-- creates a ForeachBatch
Sink
println(s"Batch ID: $batchId")
output.show
}
.start
// q.stop
scala> println(q.lastProgress.sink.description)
ForeachBatchSink
Encoder ( ExpressionEncoder[T] )
Note addBatch is a part of Sink Contract to "add" a batch of data to the sink.
412
ForeachBatchSink
addBatch …FIXME
413
Memory Data Source
MemoryStreamBase
MemorySinkBase
Memory data source supports Micro-Batch and Continuous stream processing modes.
Memory Data Source is not for production use due to design contraints, e.g.
infinite in-memory collection of lines read and no fault recovery.
Caution
MemoryStream is designed primarily for unit tests, tutorials and debugging.
Memory Sink
Memory sink requires that a streaming query has a name (defined using
DataStreamWriter.queryName or queryName option).
Memory sink may optionally define checkpoint location using checkpointLocation option
that is used to recover from for Complete output mode only.
414
Memory Data Source
Examples
Memory Source in Micro-Batch Stream Processing
415
Memory Data Source
import org.apache.spark.sql.execution.streaming.MemoryStream
// It uses two implicits: Encoder[Int] and SQLContext
val intsIn = MemoryStream[Int]
scala> intsOut.show
+-----+
|value|
+-----+
| 0|
| 1|
| 2|
+-----+
memoryQuery.stop()
416
Memory Data Source
import org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isInstanceOf[StreamingQuery])
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val se = sq.asInstanceOf[StreamingQueryWrapper].streamingQuery
import org.apache.spark.sql.execution.streaming.MemorySink
val sink = se.sink.asInstanceOf[MemorySink]
assert(sink.toString == "MemorySink")
sink.clear()
417
MemoryStream
Refer to Logging.
ID
SQLContext
apply[A : Encoder](
implicit sqlContext: SQLContext): MemoryStream[A]
addData(
data: TraversableOnce[A]): Offset
418
MemoryStream
Internally, addData prints out the following DEBUG message to the logs:
Adding: [data]
In the end, addData increments the current offset and adds the data to the batches internal
registry.
When executed, getBatch uses the internal batches collection to return requested offsets.
logicalPlan: LogicalPlan
attributes).
scala> ints.toDS.queryExecution.logical.isStreaming
res14: Boolean = true
scala> ints.toDS.queryExecution.logical
res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MemoryStream[value#13]
419
MemoryStream
Dataset ).
toString: String
toString uses the output schema to return the following textual representation:
MemoryStream[[output]]
planInputPartitions(): java.util.List[InputPartition[InternalRow]]
planInputPartitions …FIXME
planInputPartitions …FIXME
generateDebugString(
rows: Seq[UnsafeRow],
startOrdinal: Int,
endOrdinal: Int): String
420
MemoryStream
Internal Properties
Name Description
batches Batch data ( ListBuffer[Array[UnsafeRow]] )
421
ContinuousMemoryStream
ContinuousMemoryStream
ContinuousMemoryStream is…FIXME
422
MemorySink
MemorySink
MemorySink is a streaming sink that stores batches (records) in memory.
MemorySink is used for memory format and requires a query name (by queryName method
or queryName option).
Its aim is to allow users to test streaming applications in the Spark shell or other local tests.
It creates MemorySink instance based on the schema of the DataFrame it operates on.
It creates a new DataFrame using MemoryPlan with MemorySink instance created earlier
and registers it as a temporary table (using DataFrame.registerTempTable method).
At this point you can query the table as if it were a regular non-streaming table
Note
using sql method.
423
MemorySink
Refer to Logging.
Output schema
OutputMode
batches: ArrayBuffer[AddedData]
batches holds data from streaming batches that have been added (written) to this sink.
For Append and Update output modes, batches holds rows from all batches.
For Complete output mode, batches holds rows from the last batch only.
addBatch(
batchId: Long,
data: DataFrame): Unit
Note addBatch is part of the Sink Contract to "add" a batch of data to the sink.
addBatch branches off based on whether the given batchId has already been committed
or not.
424
MemorySink
A batch ID is considered committed when the given batch ID is greater than the latest batch
ID (if available).
For Append and Update output modes, addBatch adds the data (as a AddedData ) to the
batches internal registry.
For Complete output mode, addBatch clears the batches internal registry first before adding
the data (as a AddedData ).
Batch Committed
With the batchId committed, addBatch simply prints out the following DEBUG message to
the logs and returns.
clear(): Unit
clear simply removes (clears) all data from the batches internal registry.
425
MemorySink
426
MemorySinkV2
427
MemoryStreamWriter
MemoryStreamWriter
MemoryStreamWriter is…FIXME
428
MemoryStreamBase
addData(
addData data: TraversableOnce[A]): Offset
Table 2. MemoryStreamBases
MemoryStreamBase Description
ContinuousMemoryStream
SQLContext
toDS(): Dataset[A]
toDS simply creates a Dataset (for the sqlContext and the logicalPlan)
429
MemoryStreamBase
toDF(): DataFrame
toDF simply creates a Dataset of rows (for the sqlContext and the logicalPlan)
Internal Properties
Name Description
Schema attributes of the encoder
attributes ( Seq[AttributeReference] )
Used when…FIXME
430
MemorySinkBase
dataSinceBatch(
dataSinceBatch sinceBatchId: Long): Seq[Row]
Table 2. MemorySinkBases
MemorySinkBase Description
431
Offsets and Metadata Checkpointing
Stream execution engines use checkpoint location to resume stream processing and get
start offsets to start query processing from.
StreamExecution resumes (populates the start offsets) from the latest checkpointed offsets
from the Write-Ahead Log (WAL) of Offsets that may have already been processed (and, if
so, committed to the Offset Commit Log).
"zero" micro-batch).
The available offsets are then added to the committed offsets when the latest batch ID
available (as described above) is exactly the latest batch ID committed to the Offset Commit
Log when MicroBatchExecution stream processing engine is requested to populate start
offsets from checkpoint.
When a streaming query is started from scratch (with no checkpoint that has offsets in the
Offset Write-Ahead Log), MicroBatchExecution prints out the following INFO message:
When a streaming query is resumed (restarted) from a checkpoint with offsets in the Offset
Write-Ahead Log, MicroBatchExecution prints out the following INFO message:
432
Offsets and Metadata Checkpointing
Every time MicroBatchExecution is requested to check whether a new data is available (in
any of the streaming sources)…FIXME
source is requested for the latest offset available that are added to the availableOffsets
registry. Streaming sources report some offsets or none at all (if this source has never
received any data). Streaming sources with no data are excluded (filtered out).
noDataBatchesEnabled = [noDataBatchesEnabled],
lastExecutionRequiresAnotherBatch =
[lastExecutionRequiresAnotherBatch], isNewDataAvailable =
[isNewDataAvailable], shouldConstructNextBatch =
[shouldConstructNextBatch]
In the end (of running a single streaming micro-batch), MicroBatchExecution commits (adds)
the available offsets (to the committedOffsets registry) so they are considered processed
already.
Limitations (Assumptions)
It is assumed that the order of streaming sources in a streaming query matches the order of
the offsets of OffsetSeq (in offsetLog) and availableOffsets.
433
Offsets and Metadata Checkpointing
In other words, a streaming query can be modified and then restarted from a checkpoint (to
maintain stream processing state) only when the number of streaming sources and their
order are preserved across restarts.
434
MetadataLog
add(
batchId: Long,
metadata: T): Boolean
get(
batchId: Long): Option[T]
get(
startId: Option[Long],
endId: Option[Long]): Array[(Long, T)]
get
Retrieves (gets) metadata of one or more batches
Used when…FIXME
435
MetadataLog
Used when…FIXME
Used when…FIXME
436
HDFSMetadataLog
HDFSMetadataLog — Hadoop DFS-based
Metadata Storage
HDFSMetadataLog is a concrete metadata storage (of type T ) that uses Hadoop DFS for
HDFSMetadataLog uses the given path as the metadata directory with metadata logs. The
HDFSMetadataLog uses Json4s with the Jackson binding for metadata serialization and
SparkSession
While being created HDFSMetadataLog creates the path unless exists already.
437
HDFSMetadataLog
serialize(
metadata: T,
out: OutputStream): Unit
serialize simply writes the log data (serialized using Json4s (with Jackson binding)
library).
deserialize(in: InputStream): T
get …FIXME
get(
startId: Option[Long],
endId: Option[Long]): Array[(Long, T)]
Note get is part of the MetadataLog Contract to get metadata of range of batches.
438
HDFSMetadataLog
get …FIXME
add(
batchId: Long,
metadata: T): Boolean
add return true when the metadata of the streaming batch was not available and
Internally, add looks up metadata of the given streaming batch ( batchId ) and returns
false when found.
Otherwise, when not found, add creates a metadata log file for the given batchId and
writes metadata to the file. add returns true if successful.
getLatest requests the internal FileManager for the files in metadata directory that match
getLatest takes the batch ids (the batch files correspond to) and sorts the ids in reverse
order.
getLatest gives the first batch id with the metadata which could be found in the metadata
storage.
It is possible that the batch id could be in the metadata storage, but not
Note
available for retrieval.
439
HDFSMetadataLog
purge …FIXME
batchIdToPath simply creates a Hadoop Path for the file called by the specified batchId
isBatchFile Method
isBatchFile …FIXME
pathToBatchId Method
pathToBatchId …FIXME
440
HDFSMetadataLog
verifyBatchIds(
batchIds: Seq[Long],
startId: Option[Long],
endId: Option[Long]): Unit
verifyBatchIds …FIXME
parseVersion(
text: String,
maxSupportedVersion: Int): Int
parseVersion …FIXME
purgeAfter Method
purgeAfter …FIXME
441
HDFSMetadataLog
writeBatchToFile(
metadata: T,
path: Path): Unit
getOrderedBatchFiles(): Array[FileStatus]
getOrderedBatchFiles …FIXME
Internal Properties
442
HDFSMetadataLog
Name Description
CheckpointFileManager
fileManager
Used when…FIXME
443
CommitLog
CommitLog writes commit metadata to files with names that are offsets.
$ ls -tr [checkpoint-directory]/commits
0 1 2 3 4 5 6 7 8 9
$ cat [checkpoint-directory]/commits/8
v1
{"nextBatchWatermarkMs": 0}
SparkSession
serialize(
metadata: CommitMetadata,
out: OutputStream): Unit
serialize writes out the version prefixed with v on a single line (e.g. v1 ) followed by the
444
CommitLog
deserialize simply reads (deserializes) two lines from the given InputStream for version
add Method
add …FIXME
add Method
add …FIXME
445
CommitMetadata
CommitMetadata
CommitMetadata is…FIXME
446
OffsetSeqLog
OffsetSeqLog uses OffsetSeq for metadata which holds an ordered collection of offsets and
OffsetSeqLog is created exclusively for the write-ahead log (WAL) of offsets of stream
OffsetSeqLog uses 1 for the version when serializing and deserializing metadata.
SparkSession
serialize(
offsetSeq: OffsetSeq,
out: OutputStream): Unit
serialize firstly writes out the version prefixed with v on a single line (e.g. v1 ) followed
serialize then writes out the offsets in JSON format, one per line.
447
OffsetSeqLog
$ ls -tr [checkpoint-directory]/offsets
0 1 2 3 4 5 6
$ cat [checkpoint-directory]/offsets/6
v1
{"batchWatermarkMs":0,"batchTimestampMs":1502872590006,"conf":{"spark.sql.shuffle.part
itions":"200","spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.exe
cution.streaming.state.HDFSBackedStateStoreProvider"}}
51
deserialize reads the optional metadata (with an empty line for metadata not available).
In the end, deserialize creates a OffsetSeq for the optional metadata and the
SerializedOffsets .
448
OffsetSeq
OffsetSeq
OffsetSeq is the metadata managed by Hadoop DFS-based metadata storage.
storage)
streaming micro-batch to commit available offsets for a batch to the write-ahead log)
addOffset
Collection of optional Offsets (with None for streaming sources with no new data
available)
toStreamProgress(
sources: Seq[BaseStreamingSource]): StreamProgress
toStreamProgress creates a new StreamProgress and adds the streaming sources for which
Offsets is a collection with holes (empty elements) for streaming sources with
Note
no new data available.
toStreamProgress throws an AssertionError if the number of the input sources does not
There are [[offsets.size]] sources in the checkpoint offsets and now there are [[sourc
es.size]] sources requested by the query. Cannot continue.
449
OffsetSeq
Note commits checkpoints and construct (or skip) the next streaming micro-
batch
ContinuousExecution is requested for start offsets
toString: String
toString simply converts the Offsets to JSON (if an offset is available) or - (a dash if an
fill(
offsets: Offset*): OffsetSeq (1)
fill(
metadata: Option[String],
offsets: Offset*): OffsetSeq
fill simply creates an OffsetSeq for the given variable sequence of Offsets and the
450
CompactibleFileStreamLog
CompactibleFileStreamLog Contract —
Compactible Metadata Logs
CompactibleFileStreamLog is the extension of the HDFSMetadataLog contract for
compactLogs
Used when CompactibleFileStreamLog is requested to
compact and allFiles
defaultCompactInterval: Int
defaultCompactInterval
Default compaction interval
Used exclusively when CompactibleFileStreamLog is
requested for the compactInterval
fileCleanupDelayMs: Long
fileCleanupDelayMs
Used exclusively when CompactibleFileStreamLog is
requested to deleteExpiredLog
isDeletingExpiredLog: Boolean
isDeletingExpiredLog
Used exclusively when CompactibleFileStreamLog is
requested to store (add) metadata of a streaming batch
451
CompactibleFileStreamLog
Table 2. CompactibleFileStreamLogs
CompactibleFileStreamLog Description
FileStreamSinkLog
Metadata version
SparkSession
batchIdToPath Method
batchIdToPath …FIXME
pathToBatchId Method
pathToBatchId …FIXME
isBatchFile Method
452
CompactibleFileStreamLog
isBatchFile …FIXME
serialize(
logData: Array[T],
out: OutputStream): Unit
serialize firstly writes the version header ( v and the metadataLogVersion) out to the
serialize then writes the log data (serialized using Json4s (with Jackson binding) library).
deserialize …FIXME
add(
batchId: Long,
logs: Array[T]): Boolean
Note add is part of the HDFSMetadataLog Contract to store metadata for a batch.
add …FIXME
allFiles Method
453
CompactibleFileStreamLog
allFiles(): Array[T]
allFiles …FIXME
FileStreamSource is created
Note
MetadataLogFileIndex is created
compact(
batchId: Long,
logs: Array[T]): Boolean
compact interval).
compact …FIXME
In the end, compact compactLogs and requests the parent HDFSMetadataLog to persist
metadata of a streaming batch (to a metadata log file).
getValidBatchesBeforeCompactionBatch Object
Method
getValidBatchesBeforeCompactionBatch(
compactionBatchId: Long,
compactInterval: Int): Seq[Long]
getValidBatchesBeforeCompactionBatch …FIXME
454
CompactibleFileStreamLog
isCompactionBatch …FIXME
getBatchIdFromFileName simply removes the .compact suffix from the given fileName and
deleteExpiredLog(
currentBatchId: Long): Unit
deleteExpiredLog does nothing and simply returns when the current batch ID incremented
deleteExpiredLog …FIXME
Internal Properties
Name Description
compactInterval Compact interval
455
CompactibleFileStreamLog
456
FileStreamSourceLog
FileStreamSourceLog
FileStreamSourceLog is a concrete CompactibleFileStreamLog (of FileEntry metadata) of
FileStreamSource.
created:
Metadata version
SparkSession
add(
batchId: Long,
logs: Array[FileEntry]): Boolean
If so (and this is a compation batch), add adds the batch and the logs to fileEntryCache
internal registry (and possibly removing the eldest entry if the size is above the cacheSize).
457
FileStreamSourceLog
get Method
get(
startId: Option[Long],
endId: Option[Long]): Array[(Long, Array[FileEntry])]
get …FIXME
Internal Properties
Name Description
Size of the fileEntryCache that is exactly the compact
interval
cacheSize
Used when the fileEntryCache is requested to add a new
entry in add and get a compaction batch
458
OffsetSeqMetadata
OffsetSeqMetadata — Metadata of Streaming
Batch
OffsetSeqMetadata holds the metadata for the current streaming batch:
SHUFFLE_PARTITIONS
STATE_STORE_PROVIDER_CLASS
STREAMING_MULTIPLE_WATERMARK_POLICY
FLATMAPGROUPSWITHSTATE_STATE_FORMAT_VERSION
STREAMING_AGGREGATION_STATE_FORMAT_VERSION
setSessionConf.
apply(
batchWatermarkMs: Long,
batchTimestampMs: Long,
sessionConf: RuntimeConfig): OffsetSeqMetadata
apply …FIXME
setSessionConf Method
459
OffsetSeqMetadata
setSessionConf …FIXME
460
CheckpointFileManager
CheckpointFileManager Contract
CheckpointFileManager is the abstraction of checkpoint managers that manage checkpoint
HDFSBackedStateStoreProvider.
createAtomic(
path: Path,
overwriteIfPossible: Boolean): CancellableFSDataOutpu
tStream
Used when:
HDFSMetadataLog is requested to store metadata for a
createAtomic batch (that writeBatchToFile)
StreamMetadata helper object is requested to persist
metadata
HDFSBackedStateStore is requested for the
deltaFileStream
HDFSBackedStateStoreProvider is requested to
writeSnapshotFile
Used when:
RenameBasedFSDataOutputStream is requested to cancel
delete
CompactibleFileStreamLog is requested to store
metadata for a batch (that deleteExpiredLog)
HDFSMetadataLog is requested to remove expired
metadata and purgeAfter
HDFSBackedStateStoreProvider is requested to do
maintenance (that cleans up)
461
CheckpointFileManager
exists
isLocal: Boolean
isLocal
list(
path: Path): Array[FileStatus] (1)
list(
path: Path,
filter: PathFilter): Array[FileStatus]
HDFSMetadataLog is created
462
CheckpointFileManager
Table 2. CheckpointFileManagers
CheckpointFileManager Description
Default CheckpointFileManager that uses
Hadoop’s FileContext API for managing
FileContextBasedCheckpointFileManager checkpoint files (unless
spark.sql.streaming.checkpointFileManagerClass
configuration property is used)
create(
path: Path,
hadoopConf: Configuration): CheckpointFileManager
hadoopConf configuration.
Could not use FileContext API for managing Structured Streaming checkpoint files at [p
ath]. Using FileSystem API instead for managing log files. If the implementation of Fi
leSystem.rename() is not atomic, then the correctness and fault-tolerance of your Stru
ctured Streaming is not guaranteed.
463
CheckpointFileManager
HDFSMetadataLog is created
464
FileContextBasedCheckpointFileManager
FileContextBasedCheckpointFileManager
FileContextBasedCheckpointFileManager is…FIXME
465
FileSystemBasedCheckpointFileManager
FileSystemBasedCheckpointFileManager —
CheckpointFileManager on Hadoop’s
FileSystem API
FileSystemBasedCheckpointFileManager is a CheckpointFileManager that uses Hadoop’s
temp-file-and-rename".
Internal Properties
466
FileSystemBasedCheckpointFileManager
Name Description
fs Hadoop’s FileSystem of the checkpoint directory
467
Offset
format.
String json()
468
Offset
Table 2. Offsets
Offset Description
ContinuousMemoryStreamOffset
FileStreamSourceOffset
KafkaSourceOffset
LongOffset
RateStreamOffset
TextSocketOffset
469
StreamProgress
get simply looks up an Offsets for the given BaseStreamingSource in the baseMap.
++ Method
++(
updates: GenTraversableOnce[(BaseStreamingSource, Offset)]): StreamProgress
++ simply creates a new StreamProgress with the baseMap and the given updates.
470
StreamProgress
toOffsetSeq(
sources: Seq[BaseStreamingSource],
metadata: OffsetSeqMetadata): OffsetSeq
toOffsetSeq creates a OffsetSeq with offsets that are looked up for every
BaseStreamingSource.
471
Micro-Batch Stream Processing
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.ProcessingTime(1.minute)) // <-- Uses MicroBatchExecution for execu
tion
.queryName("rate2console")
.start
assert(sq.isActive)
scala> sq.explain
== Physical Plan ==
WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@
678e6267
+- *(1) Project [timestamp#54, value#55L]
+- *(1) ScanV2 rate[timestamp#54, value#55L]
// sq.stop
1. triggerExecution
472
Micro-Batch Stream Processing
3. getEndOffset
4. walCommit
5. getBatch
6. queryPlanning
7. addBatch
Execution phases with execution times are available using StreamingQueryProgress under
durationMs .
scala> :type sq
org.apache.spark.sql.streaming.StreamingQuery
sq.lastProgress.durationMs.get("walCommit")
473
Micro-Batch Stream Processing
474
MicroBatchExecution
MicroBatchExecution — Stream Execution
Engine of Micro-Batch Stream Processing
MicroBatchExecution is the stream execution engine in Micro-Batch Stream Processing.
import org.apache.spark.sql.streaming.Trigger
val query = spark
.readStream
.format("rate")
.load
.writeStream
.format("console") // <-- not a StreamWriteSupport sink
.option("truncate", false)
.trigger(Trigger.Once) // <-- Gives MicroBatchExecution
.queryName("rate2console")
.start
import org.apache.spark.sql.execution.streaming.MicroBatchExecution
val microBatchEngine = engine.asInstanceOf[MicroBatchExecution]
assert(microBatchEngine.trigger == Trigger.Once)
475
MicroBatchExecution
Refer to Logging.
SparkSession
Streaming sink
Trigger
Output mode
triggerExecutor: TriggerExecutor
476
MicroBatchExecution
triggerExecutor is initialized based on the given Trigger (that was used to create the
MicroBatchExecution ):
triggerExecutor throws an IllegalStateException when the Trigger is not one of the built-
in implementations.
runActivatedStream(
sparkSessionForStream: SparkSession): Unit
the batch runner (until MicroBatchExecution is terminated due to a query stop or a failure).
Note trigger and batch are considered equivalent and used interchangeably.
The batch runner initializes query progress for the new trigger (aka startTrigger).
The batch runner starts triggerExecution execution phase that is made up of the following
steps:
1. Populating start offsets from checkpoint before the first "zero" batch (at every start or
restart)
477
MicroBatchExecution
At the start or restart (resume) of a streaming query (when the current batch ID is
uninitialized and -1 ), the batch runner populates start offsets from checkpoint and then
prints out the following INFO message to the logs (using the committedOffsets internal
registry):
The batch runner sets the human-readable description for any Spark job submitted (that
streaming sources may submit to get new data) as the batch description.
The batch runner constructs the next streaming micro-batch (when the
isCurrentBatchConstructed internal flag is off).
The batch runner records trigger offsets (with the committed and available offsets).
The batch runner updates the current StreamingQueryStatus with the isNewDataAvailable
for isDataAvailable property.
With the isCurrentBatchConstructed flag enabled ( true ), the batch runner updates the
status message to one of the following (per isNewDataAvailable) and runs the streaming
micro-batch.
With the isCurrentBatchConstructed flag disabled ( false ), the batch runner simply updates
the status message to the following:
The batch runner finalizes query progress for the trigger (with a flag that indicates whether
the current batch had new data).
With the isCurrentBatchConstructed flag enabled ( true ), the batch runner increments the
currentBatchId and turns the isCurrentBatchConstructed flag off ( false ).
With the isCurrentBatchConstructed flag disabled ( false ), the batch runner simply sleeps
(as long as configured using the spark.sql.streaming.pollingDelay configuration property).
478
MicroBatchExecution
In the end, the batch runner updates the status message to the following status and returns
whether the MicroBatchExecution is active or not.
populateStartOffsets(
sparkSessionToRunBatches: SparkSession): Unit
populateStartOffsets requests the Offset Write-Ahead Log for the latest committed batch id
The batch id could not be available in the write-ahead log when a streaming
Note query started with a new log or no batch was persisted (added) to the log
before.
populateStartOffsets branches off based on whether the latest committed batch was
available or not.
When the latest batch ID found is greater than 0 , populateStartOffsets requests the
Offset Write-Ahead Log for the second latest batch ID with metadata or throws an
IllegalStateException if not found.
populateStartOffsets sets the committed offsets to the second latest committed offsets.
479
MicroBatchExecution
populateStartOffsets requests the Offset Commit Log for the latest committed batch id with
When the latest committed batch id with metadata was found which is exactly the latest
batch ID (found in the Offset Commit Log), populateStartOffsets …FIXME
When the latest committed batch id with metadata was found, but it is not exactly the second
latest batch ID (found in the Offset Commit Log), populateStartOffsets prints out the
following WARN message to the logs:
When no commit log present in the Offset Commit Log, populateStartOffsets prints out the
following INFO message to the logs:
In the end, populateStartOffsets prints out the following DEBUG message to the logs:
480
MicroBatchExecution
WatermarkTracker.
constructNextBatch(
noDataBatchesEnabled: Boolean): Boolean
1. Requesting the latest offsets from every streaming source (of the streaming query)
3. Updating batch metadata with the current event-time watermark and batch timestamp
In the end, constructNextBatch returns whether the next streaming micro-batch was
constructed or skipped.
constructNextBatch checks out the latest offset in every streaming data source
Note
sequentially, i.e. one data source at a time.
481
MicroBatchExecution
In getOffset time-tracking section, constructNextBatch requests the Source for the latest
offset.
For every MicroBatchReader (Data Source API V2), constructNextBatch updates the status
message to the following:
offsets.
482
MicroBatchExecution
constructed ( lastExecutionRequiresAnotherBatch ).
another micro-batch (using the batch metadata) and the given noDataBatchesEnabled flag is
enabled ( true ).
constructNextBatch also checks out whether new data is available (based on available and
committed offsets).
noDataBatchesEnabled = [noDataBatchesEnabled],
lastExecutionRequiresAnotherBatch =
[lastExecutionRequiresAnotherBatch], isNewDataAvailable =
[isNewDataAvailable], shouldConstructNextBatch =
[shouldConstructNextBatch]
constructNextBatch branches off per whether to constructs or skip the next batch (per
483
MicroBatchExecution
In case of a failure while adding the available offsets to the write-ahead log,
constructNextBatch throws an AssertionError :
Concurrent update to the log. Multiple streaming jobs detected for [currentBatchId]
runBatch(
sparkSessionToRunBatch: SparkSession): Unit
runBatch prints out the following DEBUG message to the logs (with the current batch ID):
1. getBatch Phase — Creating Logical Query Plans For Unprocessed Data From Sources
and MicroBatchReaders
484
MicroBatchExecution
2. Transforming Logical Plan to Include Sources and MicroBatchReaders with New Data
In the end, runBatch prints out the following DEBUG message to the logs (with the current
batch ID):
runBatch requests sources and readers for data per offset range sequentially,
Note
one by one.
485
MicroBatchExecution
Requests the committedOffsets for the committed offsets for the Source (if available)
Requests the Source for a dataframe for the offset range (the current and available
offsets)
In the end, runBatch returns the Source and the logical plan of the streaming dataset (for
the offset range).
In case the Source returns a dataframe that is not streaming, runBatch throws an
AssertionError :
Requests the committedOffsets for the committed offsets for the MicroBatchReader (if
available)
Requests the MicroBatchReader to set the offset range (the current and available
offsets)
486
MicroBatchExecution
runBatch looks up the DataSourceV2 and the options for the MicroBatchReader (in the
In the end, runBatch requests the MicroBatchReader for the read schema and creates a
StreamingDataSourceV2Relation logical operator (with the read schema, the DataSourceV2 ,
with new data ( newBatchesPlan with logical plans to process data that has arrived since the
last batch).
487
MicroBatchExecution
If the logical plan is found, runBatch makes the plan a child operator of Project (with
Aliases ) logical operator and replaces the StreamingExecutionRelation .
Otherwise, if not found, runBatch simply creates an empty streaming LocalRelation (for
scanning data from an empty local collection).
logical plan (with new data) with the current batch timestamp (based on the batch metadata).
for the new StreamWriteSupport sinks (per the type of the BaseStreamingSink).
and the extra options). runBatch then creates a WriteToDataSourceV2 logical operator with
a new MicroBatchWriter as a child operator (for the current batch ID and the StreamWriter).
488
MicroBatchExecution
__is_continuous_processing false
Output mode
Run id
Batch id
In the end (of the queryPlanning phase), runBatch requests the IncrementalExecution to
prepare the transformed logical plan for execution (i.e. execute the executedPlan query
execution phase).
489
MicroBatchExecution
The DataFrame represents the result of executing the current micro-batch of the streaming
query.
For a Sink (Data Source API V1), runBatch simply requests the Sink to add the
DataFrame (with the batch ID).
For a StreamWriteSupport (Data Source API V2), runBatch simply requests the DataFrame
with new data to collect (which simply forces execution of the MicroBatchWriter).
490
MicroBatchExecution
runBatch requests the Offset Commit Log to persisting metadata of the streaming micro-
batch (with the current batch ID and event-time watermark of the WatermarkTracker).
In the end, runBatch adds the available offsets to the committed offsets (and updates the
offsets of every BaseStreamingSource with new data in the current micro-batch).
stop(): Unit
When the stream execution thread is alive, stop requests the current SparkContext to
cancelJobGroup identified by the runId and waits for this thread to die. Just to make sure
that there are no more streaming jobs, stop requests the current SparkContext to
cancelJobGroup identified by the runId again.
In the end, stop prints out the following INFO message to the logs:
491
MicroBatchExecution
isNewDataAvailable: Boolean
isNewDataAvailable checks whether there is a streaming source (in the available offsets) for
which committed offsets are different from the available offsets or not available (committed)
at all.
isNewDataAvailable is positive ( true ) when there is at least one such streaming source.
logicalPlan: LogicalPlan
For every StreamingRelation logical operator, logicalPlan tries to replace it with the
StreamingExecutionRelation that was used earlier for the same StreamingRelation (if used
multiple times in the plan) or creates a new one. While creating a new
StreamingExecutionRelation , logicalPlan requests the DataSource to create a streaming
Source with the metadata path as sources/uniqueID directory in the checkpoint root
directory. logicalPlan prints out the following INFO message to the logs:
492
MicroBatchExecution
sources/uniqueID directory in the checkpoint root directory. logicalPlan prints out the
For every other StreamingRelationV2 logical operator, logicalPlan tries to replace it with
the StreamingExecutionRelation that was used earlier for the same StreamingRelationV2 (if
used multiple times in the plan) or creates a new one. While creating a new
StreamingExecutionRelation , logicalPlan requests the StreamingRelation for the
underlying DataSource that is in turn requested to create a streaming Source with the
metadata path as sources/uniqueID directory in the checkpoint root directory. logicalPlan
prints out the following INFO message to the logs:
In the end, logicalPlan sets the uniqueSources internal registry to be the unique
BaseStreamingSources above.
logicalPlan throws an AssertionError when not executed on the stream execution thread.
logicalPlan must be initialized in QueryExecutionThread but the current thread was [cu
rrentThread]
the current batch or epoch IDs (that Spark tasks can use)
493
MicroBatchExecution
Internal Properties
Name Description
Flag to control whether to run a streaming micro-batch
( true ) or not ( false )
Default: false
readerToDataSourceMap
( Map[MicroBatchReader, (DataSourceV2, Map[String,
String])] )
Default: (empty)
494
MicroBatchExecution
495
MicroBatchWriter
epoch when requested to commit, abort and create a WriterFactory for a given
StreamWriter in Micro-Batch Stream Processing.
496
MicroBatchReadSupport
MicroBatchReadSupport Contract — Data
Sources with MicroBatchReaders
MicroBatchReadSupport is the extension of the DataSourceV2 for data sources with a
MicroBatchReader.
MicroBatchReader createMicroBatchReader(
Optional<StructType> schema,
String checkpointLocation,
DataSourceOptions options)
Table 1. MicroBatchReadSupports
MicroBatchReadSupport Description
KafkaSourceProvider Data source provider for kafka format
497
MicroBatchReader
Tip Read up on Data Source API V2 in The Internals of Spark SQL book.
498
MicroBatchReader
Used when…FIXME
deserializeOffset
Deserializes offset (from JSON format)
Used when…FIXME
Offset getEndOffset()
getEndOffset
End offset of this reader
Used when…FIXME
Offset getStartOffset()
getStartOffset
Start (beginning) offsets of this reader
Used when…FIXME
void setOffsetRange(
Optional<Offset> start,
Optional<Offset> end)
setOffsetRange
Sets the desired offset range for input partitions created from this
reader (for data scan)
Used when…FIXME
Table 2. MicroBatchReaders
MicroBatchReader Description
KafkaMicroBatchReader
MemoryStream
RateStreamMicroBatchReader
TextSocketMicroBatchReader
499
MicroBatchReader
500
WatermarkTracker
WatermarkTracker
WatermarkTracker tracks the event-time watermark of a streaming query (across
requested to populate start offsets (when requested to run an activated streaming query).
Refer to Logging.
setWatermark Method
501
WatermarkTracker
setWatermark simply updates the global event-time watermark to the given newWatermarkMs .
updateWatermark …FIXME
Internal Properties
Name Description
Current global event-time watermark per
MultipleWatermarkPolicy (across all
EventTimeWatermarkExec operators in a physical query
globalWatermarkMs plan)
Default: 0
Used when…FIXME
Used when…FIXME
502
Source
Source is part of Data Source API V1 and used in Micro-Batch Stream Processing only.
For fault tolerance, Source must be able to replay an arbitrary sequence of past data in a
stream using a range of offsets. This is the assumption so Structured Streaming can achieve
end-to-end exactly-once guarantees.
getBatch(
start: Option[Offset],
end: Offset): DataFrame
503
Source
getOffset: Option[Offset]
schema: StructType
Table 2. Sources
Source Description
FileStreamSource Part of file-based data sources ( FileFormat )
504
StreamSourceProvider
StreamSourceProvider Contract — Streaming
Source Providers for Micro-Batch Stream
Processing (Data Source API V1)
StreamSourceProvider is the contract of data source providers that can create a streaming
source for a format (e.g. text file) or system (e.g. Apache Kafka).
Processing only.
createSource(
sqlContext: SQLContext,
metadataPath: String,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): Source
sourceSchema(
sqlContext: SQLContext,
schema: Option[StructType],
providerName: String,
parameters: Map[String, String]): (String, StructType)
sourceSchema
505
StreamSourceProvider
506
Sink
batches to an output.
Sink is part of Data Source API V1 and used in Micro-Batch Stream Processing only.
addBatch(
batchId: Long,
data: DataFrame): Unit
Table 2. Sinks
Sink Description
FileStreamSink Used in file-based data sources ( FileFormat )
507
StreamSinkProvider
StreamSinkProvider Contract
StreamSinkProvider is the abstraction of providers that can create a streaming sink for a file
createSink(
sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink
createSink
508
Continuous Stream Processing
Continuous Stream Processing execution engine uses the novel Data Source API V2
(Spark SQL) and for the very first time makes stream processing truly continuous.
Tip Read up on Data Source API V2 in The Internals of Spark SQL book.
Because of the two innovative changes Continuous Stream Processing is often referred to
as Structured Streaming V2.
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.Continuous(15.seconds)) // <-- Uses ContinuousExecution for executi
on
.queryName("rate2console")
.start
scala> :type sq
org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isActive)
// sq.stop
509
Continuous Stream Processing
scala> :type sq
org.apache.spark.sql.streaming.StreamingQuery
scala> sq.explain
== Physical Plan ==
WriteToContinuousDataSource ConsoleWriter[numRows=20, truncate=false]
+- *(1) Project [timestamp#758, value#759L]
+- *(1) ScanV2 rate[timestamp#758, value#759L]
That collect operator is how a Spark job is run (as tasks over all partitions of the RDD) as
described by the ContinuousWriteRDD.compute "protocol" (a recipe for the tasks to be
scheduled to run on Spark executors).
510
ContinuousExecution
ContinuousExecution — Stream Execution
Engine of Continuous Stream Processing
ContinuousExecution is the stream execution engine of Continuous Stream Processing.
asserts it when addOffset and committing an epoch). When requested for available
streaming sources, ContinuousExecution simply gives the single ContinuousReader.
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.format("rate")
.load
.writeStream
.format("console")
.option("truncate", false)
.trigger(Trigger.Continuous(1.minute)) // <-- Gives ContinuousExecution
.queryName("rate2console")
.start
import org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isInstanceOf[StreamingQuery])
import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
val continuousEngine = engine.asInstanceOf[ContinuousExecution]
assert(continuousEngine.trigger == Trigger.Continuous(1.minute))
511
ContinuousExecution
When created (for a streaming query), ContinuousExecution is given the analyzed logical
plan. The analyzed logical plan is immediately transformed to include a
ContinuousExecutionRelation for every StreamingRelationV2 with ContinuousReadSupport
data source (and is the logical plan internally).
Refer to Logging.
runActivatedStream simply runs the streaming query in continuous mode as long as the
state is ACTIVE.
512
ContinuousExecution
logical plan to find ContinuousExecutionRelation leaf logical operators and requests their
ContinuousReadSupport data sources to create a ContinuousReader (with the sources
metadata directory under the checkpoint directory).
distinct.
runContinuous gets the start offsets (they may or may not be available).
ContinuousReader itself).
__is_continuous_processing as true
513
ContinuousExecution
The EpochCoordinator RPC endpoint runs on the driver as the single point to
Note
coordinate epochs across partition tasks.
getStartOffsets …FIXME
In essence, commit adds the given epoch to commit log and the committedOffsets, and
requests the ContinuousReader to commit the corresponding offset. In the end, commit
removes old log entries from the offset and commit logs (to keep
spark.sql.streaming.minBatchesToRetain entries only).
514
ContinuousExecution
At this point, commit may simply return when the stream execution thread is no longer alive
(died).
commit requests the commit log to store a metadata for the epoch.
commit requests the single ContinuousReader to deserialize the offset for the epoch (from
commit adds the single ContinuousReader and the offset (for the epoch) to the
committedOffsets registry.
commit requests the offset and commit logs to remove log entries to keep
spark.sql.streaming.minBatchesToRetain only.
commit then acquires the awaitProgressLock, wakes up all threads waiting for the
commit asserts that the given epoch is available in the offsetLog internal registry (i.e. the
addOffset Method
addOffset(
epoch: Long,
reader: ContinuousReader,
partitionOffsets: Seq[PartitionOffset]): Unit
515
ContinuousExecution
Figure 1. ContinuousExecution.addOffset
Internally, addOffset requests the given ContinuousReader to mergeOffsets (with the given
PartitionOffsets ) and to get the current "global" offset back.
addOffset then requests the OffsetSeqLog to add the current "global" offset for the given
epoch .
addOffset requests the OffsetSeqLog for the offset at the previous epoch.
If the offsets at the current and previous epochs are the same, addOffset turns the
noNewData internal flag on.
addOffset then acquires the awaitProgressLock, wakes up all threads waiting for the
logicalPlan: LogicalPlan
516
ContinuousExecution
SparkSession
StreamWriteSupport
Trigger
Clock
Output mode
stop(): Unit
517
ContinuousExecution
If the queryExecutionThread is alive (i.e. it has been started and has not yet died), stop
interrupts it and waits for this thread to die.
In the end, stop prints out the following INFO message to the logs:
awaitEpoch …FIXME
Internal Properties
518
ContinuousExecution
Name Description
continuousSources: Seq[ContinuousReader]
FIXME
currentEpochCoordinatorId
Used when…FIXME
triggerExecutor
Used when…FIXME
StreamExecution throws an
Note IllegalStateException when the Trigger is
not a ContinuousTrigger.
519
ContinuousReadSupport Contract
ContinuousReadSupport Contract — Data
Sources with ContinuousReaders
ContinuousReadSupport is the extension of the DataSourceV2 for data sources with a
ContinuousReader.
ContinuousReader createContinuousReader(
Optional<StructType> schema,
String checkpointLocation,
DataSourceOptions options)
Table 1. ContinuousReadSupports
ContinuousReadSupport Description
ContinuousMemoryStream Data source provider for memory format
520
ContinuousReader Contract
Tip Read up on Data Source API V2 in The Internals of Spark SQL book.
521
ContinuousReader Contract
commit
Commits the specified offset
Used exclusively when ContinuousExecution is requested to
commit an epoch
deserializeOffset
Deserializes an offset from JSON representation
Used when ContinuousExecution is requested to run a
streaming query and commit an epoch
Offset getStartOffset()
getStartOffset
mergeOffsets
Used exclusively when ContinuousExecution is requested to
addOffset
boolean needsReconfiguration()
setStartOffset
Used exclusively when ContinuousExecution is requested to
run the streaming query in continuous mode.
522
ContinuousReader Contract
Table 2. ContinuousReaders
ContinuousReader Description
ContinuousMemoryStream
KafkaContinuousReader
RateStreamContinuousReader
TextSocketContinuousReader
523
RateStreamContinuousReader
RateStreamContinuousReader
RateStreamContinuousReader is a ContinuousReader that…FIXME
524
EpochCoordinator RPC Endpoint
525
EpochCoordinator RPC Endpoint
ReportPartitionOffset
Sent out (in one-way asynchronous mode)
Partition ID exclusively when ContinuousQueuedDataReader is
Epoch requested for the next row to be read in the current
epoch, and the epoch is done
PartitionOffset
526
EpochCoordinator RPC Endpoint
Refer to Logging.
CommitPartitionEpoch
ReportPartitionOffset
With the queryWritesStopped turned on, receive simply swallows messages and does
nothing.
GetCurrentEpoch
IncrementAndGetEpoch
SetReaderPartitions
527
EpochCoordinator RPC Endpoint
SetWriterPartitions
StopContinuousExecutionWrites
resolveCommitsAtEpoch …FIXME
commitEpoch(
epoch: Long,
messages: Iterable[WriterCommitMessage]): Unit
commitEpoch …FIXME
StreamWriter
ContinuousReader
ContinuousExecution
Start epoch
SparkSession
RpcEnv
create(
writer: StreamWriter,
reader: ContinuousReader,
query: ContinuousExecution,
epochCoordinatorId: String,
startEpoch: Long,
session: SparkSession,
env: SparkEnv): RpcEndpointRef
create simply creates a new EpochCoordinator and requests the RpcEnv to register a
Internal Properties
Name Description
529
EpochCoordinatorRef
EpochCoordinatorRef
EpochCoordinatorRef is…FIXME
create(
writer: StreamWriter,
reader: ContinuousReader,
query: ContinuousExecution,
epochCoordinatorId: String,
startEpoch: Long,
session: SparkSession,
env: SparkEnv): RpcEndpointRef
create …FIXME
get …FIXME
530
EpochCoordinatorRef
531
EpochTracker
EpochTracker
EpochTracker is…FIXME
getCurrentEpoch: Option[Long]
getCurrentEpoch …FIXME
incrementCurrentEpoch(): Unit
incrementCurrentEpoch …FIXME
532
ContinuousQueuedDataReader
ContinuousQueuedDataReader
ContinuousQueuedDataReader is created exclusively when ContinuousDataSourceRDD is
EpochMarker
next(): InternalRow
next …FIXME
close(): Unit
close is part of the java.io.Closeable to close this stream and release any
Note
system resources associated with it.
close …FIXME
ContinuousDataSourceRDDPartition
TaskContext
epochPollIntervalMs
533
ContinuousQueuedDataReader
Internal Properties
Name Description
PartitionOffset
currentOffset
Used when…FIXME
java.util.concurrent.ScheduledExecutorService
epochMarkerExecutor
Used when…FIXME
EpochMarkerGenerator
epochMarkerGenerator
Used when…FIXME
InputPartitionReader
reader
Used when…FIXME
java.util.concurrent.ArrayBlockingQueue of
queue ContinuousRecords (of the given data size)
Used when…FIXME
534
DataReaderThread
DataReaderThread
DataReaderThread is…FIXME
535
EpochMarkerGenerator
EpochMarkerGenerator Thread
EpochMarkerGenerator is…FIXME
run Method
run(): Unit
run …FIXME
536
PartitionOffset
PartitionOffset
PartitionOffset is…FIXME
537
ContinuousExecutionRelation Leaf Logical Operator
Tip Read up on Leaf Logical Operators in The Internals of Spark SQL book.
ContinuousReadSupport source
SparkSession
538
WriteToContinuousDataSource Unary Logical Operator
StreamWriter
539
WriteToContinuousDataSourceExec Unary Physical Operator
WriteToContinuousDataSourceExec Unary
Physical Operator
WriteToContinuousDataSourceExec is a unary physical operator that creates a
StreamWriter
Refer to Logging.
doExecute(): RDD[InternalRow]
540
WriteToContinuousDataSourceExec Unary Physical Operator
doExecute then requests the child physical operator to execute (that gives a
ContinuousWriteRDD.
Start processing data source writer: [writer]. The input RDD has [partitions] partitio
ns.
The EpochCoordinator RPC endpoint runs on the driver as the single point to
Note
coordinate epochs across partition tasks.
In the end, doExecute requests the ContinuousWriteRDD to collect (which simply runs a
Spark job on all partitions in an RDD and returns the results in an array).
541
ContinuousWriteRDD
ContinuousWriteRDD — RDD of
WriteToContinuousDataSourceExec Unary
Physical Operator
ContinuousWriteRDD is a specialized RDD ( RDD[Unit] ) that is used exclusively as the
ContinuousWriteRDD uses the parent RDD for the partitions and the partitioner.
compute(
split: Partition,
context: TaskContext): Iterator[Unit]
The EpochCoordinator RPC endpoint runs on the driver as the single point to
Note
coordinate epochs across partition tasks.
compute then executes the following steps (in a loop) until the task (as the given
compute requests the parent RDD to compute the given partition (that gives an
Iterator[InternalRow] ).
542
ContinuousWriteRDD
compute requests the DataWriterFactory to create a DataWriter (for the partition and the
task attempt IDs from the given TaskContext and the current epoch from the EpochTracker
helper) and requests it to write all records (from the Iterator[InternalRow] ).
In the end (of the loop), compute uses the EpochTracker helper to incrementCurrentEpoch.
In case of an error, compute prints out the following ERROR message to the logs and
requests the DataWriter to abort.
In the end, compute prints out the following ERROR message to the logs:
543
ContinuousDataSourceRDD
ContinuousDataSourceRDD — Input RDD of
DataSourceV2ScanExec Physical Operator
with ContinuousReader
ContinuousDataSourceRDD is a specialized RDD ( RDD[InternalRow] ) that is used exclusively
for the only input RDD (with the input rows) of DataSourceV2ScanExec leaf physical operator
with a ContinuousReader.
operator is requested for the input RDDs (which there is only one actually).
SparkContext
epochPollIntervalMs
InputPartition[InternalRow] s
preferred host locations (where the input partition reader can run faster).
compute(
split: Partition,
context: TaskContext): Iterator[InternalRow]
compute …FIXME
getPartitions Method
544
ContinuousDataSourceRDD
getPartitions: Array[Partition]
Note getPartitions is part of the RDD Contract to specify the partitions to compute.
getPartitions …FIXME
545
StreamExecution
engines) that can run a structured query (on a stream execution thread).
logicalPlan: LogicalPlan
runActivatedStream(
sparkSessionForStream: SparkSession): Unit
import org.apache.spark.sql.streaming.StreamingQuery
assert(sq.isInstanceOf[StreamingQuery])
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val se = sq.asInstanceOf[StreamingQueryWrapper].streamingQuery
scala> :type se
org.apache.spark.sql.execution.streaming.StreamExecution
546
StreamExecution
to allow the StreamExecutions to discard old log entries (from the offset and commit logs).
Table 2. StreamExecutions
StreamExecution Description
ContinuousExecution Used in Continuous Stream Processing
Dataset) that is executed every trigger and in the end adds the results to a sink.
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val q = spark.
readStream.
format("rate").
load.
writeStream.
format("console").
trigger(Trigger.ProcessingTime(10.minutes)).
start
scala> :type q
org.apache.spark.sql.streaming.StreamingQuery
547
StreamExecution
When started, StreamExecution starts a stream execution thread that simply runs stream
processing (and hence the streaming query).
548
StreamExecution
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
val sq = spark
.readStream
.text("server-logs")
.writeStream
.format("console")
.queryName("debug")
.trigger(Trigger.ProcessingTime(20.seconds))
.start
// Enable the log level to see the INFO and DEBUG messages
// log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG
549
StreamExecution
550
StreamExecution
scala> :type q
org.apache.spark.sql.streaming.StreamingQuery
scala> println(q.lastProgress)
{
"id" : "03fc78fc-fe19-408c-a1ae-812d0e28fcee",
"runId" : "8c247071-afba-40e5-aad2-0e6f45f22488",
"name" : null,
"timestamp" : "2017-08-14T20:30:00.004Z",
"batchId" : 1,
"numInputRows" : 432,
"inputRowsPerSecond" : 0.9993568953312452,
"processedRowsPerSecond" : 1380.1916932907347,
"durationMs" : {
"addBatch" : 237,
"getBatch" : 26,
"getOffset" : 0,
"queryPlanning" : 1,
"triggerExecution" : 313,
"walCommit" : 45
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]"
,
"startOffset" : 0,
"endOffset" : 432,
"numInputRows" : 432,
"inputRowsPerSecond" : 0.9993568953312452,
"processedRowsPerSecond" : 1380.1916932907347
} ],
"sink" : {
"description" : "ConsoleSink[numRows=20, truncate=true]"
}
}
log (to record offsets to be processed) and that have already been processed and
committed to a streaming sink, respectively.
StreamExecution delays polling for new data for 10 milliseconds (when no data was
551
StreamExecution
Since the StreamMetadata is persisted (to the metadata file in the checkpoint
Note directory), the streaming query ID "survives" query restarts as long as the
checkpoint directory is preserved.
runId does not "survive" query restarts and will always be different yet unique
Note
(across all active queries).
The name, id and runId are all unique across all active queries (in a
StreamingQueryManager). The difference is that:
name is optional and user-defined
Note id is a UUID that is auto-generated at the time StreamExecution is created
and persisted to metadata checkpoint file
runId is a UUID that is auto-generated every time StreamExecution is
created
checkpoint directory. If the metadata file is available it is read and is the way to recover the
ID of a streaming query when resumed (i.e. restarted after a failure or a planned stop).
Refer to Logging.
552
StreamExecution
SparkSession
Streaming sink
Trigger
Clock
Output mode
offsetLog: OffsetSeqLog
metadata directory.
offsetLog is used as Write-Ahead Log of Offsets to persist offsets of the data about to be
Metadata log or metadata checkpoint are synonyms and are often used
Note
interchangeably.
553
StreamExecution
state: AtomicReference[State]
state indicates the internal state of execution of the streaming query (as
java.util.concurrent.atomic.AtomicReference).
Table 3. States
Name Description
StreamExecution has been requested to run stream
ACTIVE processing (and is about to run the activated streaming
query)
availableOffsets: StreamProgress
554
StreamExecution
availableOffsets is a collection of offsets per streaming source to track what data (by
offset) is available for processing for every streaming source in the streaming query (and
have not yet been committed).
committedOffsets: StreamProgress
committedOffsets is a collection of offsets per streaming source to track what data (by
offset) has already been processed and committed (to the sink or state stores) for every
streaming source in the streaming query.
555
StreamExecution
resolvedCheckpointRoot: String
The given checkpoint root directory is defined using checkpointLocation option or the
spark.sql.streaming.checkpointLocation configuration property with queryName option.
resolvedCheckpointRoot is used when creating the path to the checkpoint directory and
directory) for streaming batches successfully executed (with a single file per batch with a file
name being the batch id) or committed epochs.
Metadata log or metadata checkpoint are synonyms and are often used
Note
interchangeably.
556
StreamExecution
requests to populate the start offsets at the very beginning of the streaming query
execution and later regularly every single batch)
mode (that in turn requests to retrieve the start offsets at the very beginning of the
streaming query execution and later regularly every commit)
lastExecution: IncrementalExecution
lastExecution is created when the stream execution engines are requested for the
following:
queryPlanning Phase)
and extractSourceToNumInputRows
557
StreamExecution
explainInternal Method
explainInternal …FIXME
stopSources(): Unit
stopSources requests every streaming source (in the streaming query) to stop.
In case of an non-fatal exception, stopSources prints out the following WARN message to
the logs:
558
StreamExecution
runStream(): Unit
runStream simply prepares the environment to execute the activated streaming query.
Internally, runStream sets the job group (to all the Spark jobs started by this thread) as
follows:
getBatchDescriptionString for the job group description (to display in web UI)
interruptOnCancel flag on
runStream uses the SparkSession to access SparkContext and assign the job
group id.
Note
Read up on SparkContext.setJobGroup method in The Internals of Apache
Spark book.
runStream notifies StreamingQueryListeners that the streaming query has been started (by
559
StreamExecution
runStream unblocks the main starting thread (by decrementing the count of the startLatch
FIXME A picture with two parallel lanes for the starting thread and daemon
Caution
one for the query.
The analyzed logical plan is a lazy value in Scala and is initialized when
Note
requested the very first time.
runStream disables adaptive query execution and cost-based join optimization (by
Executes the activated streaming query (which is different per StreamExecution, i.e.
ContinuousExecution or MicroBatchExecution).
runBatches does the main work only when first started (i.e. when state is
Note
INITIALIZING ).
Once TriggerExecutor has finished executing batches, runBatches updates the status
message to Stopped.
runStream stopSources.
560
StreamExecution
runStream removes the stream metrics reporter from the application’s MetricsSystem .
query.
runStream creates a new QueryTerminatedEvent (with the id and run id of the streaming
As long as the query is not stopped (i.e. state is not TERMINATED ), batchRunner executes
the streaming batch for the trigger.
1. populateStartOffsets
1. Constructing
2. Setting Job Description as getBatchDescriptionString the next
streaming
DEBUG Stream running from [committedOffsets] to [availableOffsets] micro-batch
561
StreamExecution
You can check out the status of a streaming query using status method.
scala> spark.streams.active(0).status
res1: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
Note "message" : "Waiting for next trigger",
"isDataAvailable" : false,
"isTriggerActive" : false
}
batchRunner then updates the status message to Processing new data and runs the
When there was data available in the sources, batchRunner updates committed offsets (by
adding the current batch id to BatchCommitLog and adding availableOffsets to
committedOffsets).
batchRunner increments the current batch id and sets the job description for all the following
When no data was available in the sources to process, batchRunner does the following:
562
StreamExecution
batchRunner updates the status message to Waiting for next trigger and returns whether
the query is currently active or not (so TriggerExecutor can decide whether to finish
executing the batches or not)
start(): Unit
When called, start prints out the following INFO message to the logs:
start then starts the stream execution thread (as a daemon thread).
Note When started, a streaming query runs in its own execution thread on JVM.
In the end, start pauses the main thread (using the startLatch until StreamExecution is
requested to run the streaming query that in turn sends a QueryStartedEvent to all
streaming listeners followed by decrementing the count of the startLatch).
563
StreamExecution
postEvent simply requests the StreamingQueryManager to post the input event (to the
processAllAvailable(): Unit
immediately).
TERMINATED state).
564
StreamExecution
execution thread:
Cannot wait for a query state from the same thread that is running the query
queryExecutionThread: QueryExecutionThread
query.
to start. At that time, start prints out the following INFO message to the logs (with the
prettyIdString and the resolvedCheckpointRoot):
When started, queryExecutionThread sets the call site and runs the streaming query.
queryExecutionThread uses the name stream execution thread for [id] (that uses
prettyIdString for the id, i.e. queryName [id = [id], runId = [runId]] ).
from Apache Spark with runUninterruptibly method for running a block of code without
being interrupted by Thread.interrupt() .
565
StreamExecution
toDebugString …FIXME
offsetSeqMetadata: OffsetSeqMetadata
offsetSeqMetadata is a OffsetSeqMetadata.
offsetSeqMetadata is then updated (with the current event-time watermark and timestamp)
isActive Method
566
StreamExecution
isActive: Boolean
exception Method
exception: Option[StreamingQueryException]
exception …FIXME
getBatchDescriptionString: String
optional name if defined, the id, the runId and batchDescription that can be init (for the
current batch ID negative) or the current batch ID itself.
567
StreamExecution
noNewData: Boolean
noNewData is a flag that indicates that a batch has completed with no new data left and
Default: false
Internal Properties
Name Description
Java’s fair reentrant mutual exclusion
awaitProgressLock
java.util.concurrent.locks.ReentrantLock (that favors
granting access to the longest-waiting thread under
contention)
awaitProgressLockCondition Lock
callSite
Current batch ID
568
StreamExecution
initializationLatch
pollingDelayMs
Set to spark.sql.streaming.pollingDelay Spark property.
prettyIdString
queryName [id = xyz, runId = abc]
startLatch
569
StreamExecution
streamDeathCause StreamingQueryException
StreamingExecutionRelation is a leaf
logical operator (i.e. LogicalPlan ) that
represents a streaming data source (and
Note
corresponds to a single StreamingRelation
uniqueSources
in analyzed logical query plan of a
streaming Dataset).
570
StreamingQueryWrapper
StreamingQueryWrapper — Serializable
StreamExecution
StreamingQueryWrapper is a serializable interface of a StreamExecution.
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val query = spark
.readStream
.format("rate")
.load
.writeStream
.format("memory")
.queryName("rate2memory")
.start
assert(query.isInstanceOf[StreamingQueryWrapper])
StreamingQueryWrapper has the same StreamExecution API and simply passes all the
571
TriggerExecutor
TriggerExecutor
TriggerExecutor is the interface for trigger executors that StreamExecution uses to
package org.apache.spark.sql.execution.streaming
trait TriggerExecutor {
def execute(batchRunner: () => Boolean): Unit
}
ProcessingTimeExecutor(
ProcessingTimeExecutor processingTime: ProcessingTime,
clock: Clock = new SystemClock())
notifyBatchFallingBehind Method
Caution FIXME
572
IncrementalExecution
IncrementalExecution — QueryExecution of
Streaming Queries
IncrementalExecution is the QueryExecution of streaming queries.
queryPlanning phase)
queryPlanning phase)
IncrementalExecution uses the statefulOperatorId internal counter for the IDs of the
stateful operators in the optimized logical plan (while applying the preparations rules) when
requested to prepare the plan for execution (in executedPlan phase).
573
IncrementalExecution
SparkSession
Batch ID
OffsetSeqMetadata
import org.apache.spark.sql.execution.streaming.StreamingQueryWrapper
val stateCheckpointDir = query
.asInstanceOf[StreamingQueryWrapper]
.streamingQuery
.lastExecution
.checkpointLocation
val stateDir = s"$checkpointLocation/state"
assert(stateCheckpointDir equals stateDir)
574
IncrementalExecution
numStateStores: Int
info of the next stateful operator (when requested to optimize a streaming physical plan
using the state preparation rule that creates the stateful physical operators:
StateStoreSaveExec, StateStoreRestoreExec, StreamingDeduplicateExec,
FlatMapGroupsWithStateExec, StreamingSymmetricHashJoinExec, and
StreamingGlobalLimitExec).
StreamingJoinStrategy
StatefulAggregationStrategy
FlatMapGroupsWithStateStrategy
575
IncrementalExecution
StreamingRelationStrategy
StreamingDeduplicationStrategy
StreamingGlobalLimitStrategy
state: Rule[SparkPlan]
StreamingDeduplicateExec
FlatMapGroupsWithStateExec
StreamingSymmetricHashJoinExec
StreamingGlobalLimitExec
state simply transforms the physical plan with the above physical operators and fills out
OutputMode
state rule is used (as part of the physical query optimizations) when IncrementalExecution
is requested to optimize (prepare) the physical plan of the streaming query (once for
ContinuousExecution and every trigger for MicroBatchExecution in their queryPlanning
phases).
Tip Read up on Physical Query Optimizations in The Internals of Spark SQL book.
576
IncrementalExecution
nextStatefulOperationStateInfo(): StatefulOperatorStateInfo
state checkpoint location, the run ID (of the streaming query), the next statefulOperator ID,
the current batch ID, and the number of state stores.
All the other properties (the state checkpoint location, the run ID, the current
batch ID, and the number of state stores) are the same within a single
Note IncrementalExecution instance.
The only two properties that may ever change are the run ID (after a streaming
query is restarted from the checkpoint) and the current batch ID (every micro-
batch in MicroBatchExecution execution engine).
(in the executedPlan physical query plan) that requires another non-data batch (per the
given OffsetSeqMetadata with the event-time watermark and the batch timestamp).
577
IncrementalExecution
assert(spark.sessionState.conf.numShufflePartitions == 1)
// END: Only for easier debugging
578
IncrementalExecution
// Start the query to access lastExecution that has the checkpoint resolved
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val t = Trigger.ProcessingTime(1.hour) // should be enough time for exploration
val sq = counts
.writeStream
.format("console")
.option("truncate", false)
.option("checkpointLocation", "/tmp/spark-streams-state-checkpoint-root")
.trigger(t)
.outputMode(OutputMode.Complete)
.start
// wait till the first batch which should happen right after start
import org.apache.spark.sql.execution.streaming._
val lastExecution = sq.asInstanceOf[StreamingQueryWrapper].streamingQuery.lastExecutio
n
scala> println(lastExecution.checkpointLocation)
file:/tmp/spark-streams-state-checkpoint-root/state
579
StreamingQueryListenerBus
SparkSession ).
LiveListenerBus
activeQueryRunIds: HashSet[UUID]
580
StreamingQueryListenerBus
SparkSession .
StreamingQueryListener (so the events gets sent out to streaming queries in the
SparkSession ).
post simply posts the input event directly to the LiveListenerBus unless it is a
QueryStartedEvent.
For a QueryStartedEvent, post adds the runId (of the streaming query that has been
started) to the activeQueryRunIds internal registry first, posts the event to the
LiveListenerBus and then postToAll.
doPostEvent Method
doPostEvent(
listener: StreamingQueryListener,
event: StreamingQueryListener.Event): Unit
581
StreamingQueryListenerBus
For any other event, doPostEvent simply does nothing (swallows it).
postToAll Method
postToAll first requests the parent ListenerBus to post the event to all registered listeners.
For a QueryTerminatedEvent, postToAll simply removes the runId (of the streaming
query that has been terminated) from the activeQueryRunIds internal registry.
582
StreamMetadata
StreamMetadata
StreamMetadata is a metadata associated with a StreamingQuery (indirectly through
StreamExecution).
import org.apache.spark.sql.execution.streaming.StreamMetadata
import org.apache.hadoop.fs.Path
val metadataPath = new Path("metadata")
scala> :type sm
Option[org.apache.spark.sql.execution.streaming.StreamMetadata]
read(
metadataFile: Path,
hadoopConf: Configuration): Option[StreamMetadata]
read returns a StreamMetadata if the metadata file was available and the content could be
583
StreamMetadata
write(
metadata: StreamMetadata,
metadataFile: Path,
hadoopConf: Configuration): Unit
write persists the given StreamMetadata to the given metadataFile file in JSON format.
584
EventTimeWatermark Unary Logical Operator
When requested for the output attributes, EventTimeWatermark logical operator goes over the
output attributes of the child logical operator to find the matching attribute based on the
eventTime attribute and updates it to include spark.watermarkDelayMs metadata key with the
watermark delay interval (converted to milliseconds).
val q = logs.
withWatermark(eventTime = "timestamp", delayThreshold = "30 seconds") // <-- creates Ev
entTimeWatermark
scala> println(q.queryExecution.logical.numberedTreeString) // <-- no EventTimeWatermark
as it was removed immediately
00 Relation[value#0] text
585
EventTimeWatermark Unary Logical Operator
output: Seq[Attribute]
output finds eventTime column in the output schema of the child logical operator and
updates the Metadata of the column with spark.watermarkDelayMs key and the
milliseconds for the delay.
// FIXME How to access/show the eventTime column with the metadata updated to include
spark.watermarkDelayMs?
import org.apache.spark.sql.catalyst.plans.logical.EventTimeWatermark
val etw = q.queryExecution.logical.asInstanceOf[EventTimeWatermark]
scala> etw.output.toStructType.printTreeString
root
|-- timestamp: timestamp (nullable = true)
|-- value: long (nullable = true)
getDelayMs(
delay: CalendarInterval): Long
getDelayMs …FIXME
586
EventTimeWatermark Unary Logical Operator
587
FlatMapGroupsWithState Unary Logical Operator
KeyValueGroupedDataset.mapGroupsWithState
KeyValueGroupedDataset.flatMapGroupsWithState
planning strategy)
588
FlatMapGroupsWithState Unary Logical Operator
Grouping attributes
Data attributes
State ExpressionEncoder
Output mode
GroupStateTimeout
589
Deduplicate Unary Logical Operator
dropDuplicates operator (that drops duplicate records for a given subset of columns).
The following code is not supported in Structured Streaming and results in an AnalysisExceptio
590
Deduplicate Unary Logical Operator
Flag whether the logical operator is for streaming (enabled) or batch (disabled) mode
591
MemoryPlan Logical Query Plan
MemoryPlan is a leaf logical operator (i.e. LogicalPlan ) that is used to query the data that
has been written into a MemorySink. MemoryPlan is created when starting continuous writing
(to a MemorySink ).
scala> intsOut.explain(true)
== Parsed Logical Plan ==
SubqueryAlias memstream
+- MemoryPlan org.apache.spark.sql.execution.streaming.MemorySink@481bf251, [value#21]
== Physical Plan ==
LocalTableScan [value#21]
592
StreamingRelation Leaf Logical Operator for Streaming Source
import org.apache.spark.sql.execution.streaming.StreamingRelation
val relation = rate.queryExecution.logical.asInstanceOf[StreamingRelation]
scala> relation.isStreaming
res1: Boolean = true
scala> println(relation)
rate
593
StreamingRelation Leaf Logical Operator for Streaming Source
apply creates a StreamingRelation for the given DataSource (that represents a streaming
source).
DataSource
594
StreamingRelationV2 Leaf Logical Operator
Tip Read up on Leaf logical operators in The Internals of Spark SQL book.
ContinuousMemoryStream is created
scala> :type sq
org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.StreamingRelationV2
val relation = sq.queryExecution.logical.asInstanceOf[StreamingRelationV2]
assert(relation.isStreaming)
DataSourceV2
Optional StreamingRelation
595
StreamingRelationV2 Leaf Logical Operator
SparkSession
596
StreamingExecutionRelation Leaf Logical Operator for Streaming Source At Execution
597
StreamingExecutionRelation Leaf Logical Operator for Streaming Source At Execution
Streaming source
Output attributes
apply creates a StreamingExecutionRelation for the input source and with the attributes
598
EventTimeWatermarkExec
The purpose of the EventTimeWatermarkExec operator is to simply extract (project) the values
of the event-time watermark column and add them directly to the EventTimeStatsAccum
internal accumulator.
Since the execution (data processing) happens on Spark executors, the only
way to establish communication between the tasks (on the executors) and the
Note driver is to use an accumulator.
Read up on Accumulators in The Internals of Apache Spark book.
the statistics (the maximum, minimum, average and update count) of the values in the event-
time watermark column that is later used in:
ProgressReporter for creating execution statistics for the most recent query execution
(for monitoring the max , min , avg , and watermark event-time watermark statistics)
599
EventTimeWatermarkExec
Event time column - the column with the (event) time for event-time watermark
doExecute(): RDD[InternalRow]
Internally, doExecute executes the child physical operator and maps over the partitions
(using RDD.mapPartitions ).
doExecute creates an unsafe projection (one per partition) for the column with the event
time in the output schema of the child physical operator. The unsafe projection is to extract
event times from the (stream of) internal rows of the child physical operator.
For every row ( InternalRow ) per partition, doExecute requests the eventTimeStats
accumulator to add the event time.
Note The event time value is in seconds (not millis as the value is divided by 1000 ).
output: Seq[Attribute]
output requests the child physical operator for the output attributes to find the event time
column and any other column with metadata that contains spark.watermarkDelayMs key.
For the event time column, output updates the metadata to include the delay interval for
the spark.watermarkDelayMs key.
600
EventTimeWatermarkExec
For any other column (not the event time column) with the spark.watermarkDelayMs key,
output simply removes the key from the metadata.
Internal Properties
Name Description
Delay interval - the delay interval in milliseconds
Used when:
601
FlatMapGroupsWithStateExec
Refer to Logging.
602
FlatMapGroupsWithStateExec
User-defined state function that is applied to every group (of type (Any,
Iterator[Any], LogicalGroupState[Any]) ⇒ Iterator[Any] )
Data attributes
Output object attribute (that is the reference to the single object field this operator
outputs)
StatefulOperatorStateInfo
OutputMode
GroupStateTimeout
Event-time watermark
603
FlatMapGroupsWithStateExec
FlatMapGroupsWithStateExec as StateStoreWriter
FlatMapGroupsWithStateExec is a stateful physical operator that can write to a state store(and
GroupStateTimeout).
metadata) when asked whether to run another batch or not (when MicroBatchExecution is
requested to construct the next streaming micro-batch when requested to run the activated
streaming query).
watermark.
604
FlatMapGroupsWithStateExec
The event-time watermark is initially undefined ( None ) when planned to for execution (in
FlatMapGroupsWithStateStrategy execution planning strategy).
The preparations rules are executed (applied to a physical query plan) at the
executedPlan phase of Structured Query Execution Pipeline to generate an
optimized physical query plan ready for execution).
Note
Read up on Structured Query Execution Pipeline in The Internals of Spark SQL
book.
The optional event-time watermark can only be defined when the state
Note preparation rule is executed which is at the executedPlan phase of Structured
Query Execution Pipeline which is also part of the queryPlanning phase.
stateManager: StateManager
A StateManager is created per state format version that is given while creating a
FlatMapGroupsWithStateExec (to choose between the available implementations).
605
FlatMapGroupsWithStateExec
All state data (for all keys) in a StateStore while processing timed-out state data
Removing the state for a key from a StateStore when all rows have been processed
Persisting the state for a key in a StateStore when all rows have been processed
keyExpressions Method
keyExpressions: Seq[Attribute]
doExecute(): RDD[InternalRow]
doExecute then requests the child physical operator to execute and generate an
RDD[InternalRow] .
606
FlatMapGroupsWithStateExec
2. (only when the GroupStateTimeout is EventTimeTimeout) Filters out late data based on
the event-time watermark, i.e. rows from a given Iterator[InternalRow] that are older
than the event-time watermark are excluded from the steps that follow
3. Requests the InputProcessor to create an iterator of a new data processed from the
(possibly filtered) iterator
5. Creates an iterator by concatenating the above iterators (with the new data processed
first)
Internal Properties
607
FlatMapGroupsWithStateExec
Name Description
stateAttributes
stateDeserializer
stateSerializer
timestampTimeoutAttribute
608
StateStoreRestoreExec
from a state store (for the keys from the child physical operator).
609
StateStoreRestoreExec
610
StateStoreRestoreExec
611
StateStoreRestoreExec
StateStoreRestoreExec and
StreamingAggregationStateManager — stateManager
Property
stateManager: StreamingAggregationStateManager
StateStoreRestoreExec .
The StreamingAggregationStateManager is created for the keys, the output schema of the
child physical operator and the version of the state format.
Extracting the columns for the key from the input row
doExecute(): RDD[InternalRow]
612
StateStoreRestoreExec
Internally, doExecute executes child physical operator and creates a StateStoreRDD with
storeUpdateFunction that does the following per child operator’s RDD partition:
1. Generates an unsafe projection to access the key field (using keyExpressions and the
output schema of child operator).
Extracts the key from the row (using the unsafe projection above)
Gets the saved state in StateStore for the key if available (it might not be if the
key appeared in the input the first time)
Increments numOutputRows metric (that in the end is the number of rows from the
child operator)
Generates collection made up of the current row and possibly the state for the key
if available
613
StateStoreSaveExec
IncrementalExecution is requested to prepare the logical plan (of a streaming query) for
614
StateStoreSaveExec
615
StateStoreSaveExec
When executed, StateStoreSaveExec executes the child physical operator and creates a
StateStoreRDD (with storeUpdateFunction specific to the output mode).
keyExpressions, the output of the child physical operator and the stateFormatVersion).
Refer to Logging.
616
StateStoreSaveExec
The following table shows how the performance metrics are computed (and so their exact
meaning).
617
StateStoreSaveExec
number of output rows equivalent to the number of total state rows metric)
For Update output mode, the number of rows that the
StreamingAggregationStateManager was requested to
store in a state store (that did not expire per the optional
watermarkPredicateForData predicate) that is
equivalent to the number of updated state rows metric)
618
StateStoreSaveExec
doExecute(): RDD[InternalRow]
doExecute requires that the optional outputMode is at this point defined (that
Note should have happened when IncrementalExecution had prepared a streaming
aggregation for execution).
storeUpdateFunction that:
1. Generates an unsafe projection to access the key field (using keyExpressions and the
output schema of child).
mode:
Note Append is the default output mode when not specified explicitly.
619
StateStoreSaveExec
1. Finds late (aggregate) rows from child physical operator (that have expired per
watermark)
2. Stores the late rows in the state store and increments the numUpdatedStateRows
metric
3. Gets all the added (late) rows from the state store
4. Creates an iterator that removes the late rows from the state store when requested the
next row and in the end commits the state updates
3. Takes all the rows from StateStore and returns a NextIterator that:
In close , records the time to iterate over all the rows in allRemovalsTimeMs
metric, commits the updates to StateStore followed by recording the time in
commitTimeMs metric and recording StateStore metrics.
2. Stores the rows by key in the state store eagerly (i.e. all rows that are available in the
parent iterator before proceeding)
620
StateStoreSaveExec
4. In the end, reads the key-row pairs from the state store and passes the rows along (i.e.
to the following physical operator)
The number of keys stored in the state store is recorded in numUpdatedStateRows metric.
2. While storing the rows, increments numUpdatedStateRows metric (for every row) and
records the total time in allUpdatesTimeMs metric.
4. Commits the state updates to StateStore and records the time in commitTimeMs
metric.
6. In the end, takes all the rows stored in StateStore and increments numOutputRows
metric.
With no more rows available, that removes the late rows from the state store (all at once)
and commits the state updates.
621
StateStoreSaveExec
1. Records the total time to iterate over all the rows in allUpdatesTimeMs metric.
3. Commits the updates to StateStore and records the time in commitTimeMs metric.
shouldRunAnotherBatch(
newMetadata: OffsetSeqMetadata): Boolean
Event-time watermark is defined and is older (below) the current event-time watermark
(of the given OffsetSeqMetadata )
622
StreamingDeduplicateExec
scala> println(uniqueValues.queryExecution.logical.numberedTreeString)
00 Deduplicate [value#214L], true
01 +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@4785f176,rate,List
(),None,List(),None,Map(),None), rate, [timestamp#213, value#214L]
scala> uniqueValues.explain
== Physical Plan ==
StreamingDeduplicate [value#214L], StatefulOperatorStateInfo(<unknown>,5a65879c-67bc-4
e77-b417-6100db6a52a2,0,0), 0
+- Exchange hashpartitioning(value#214L, 200)
+- StreamingRelation rate, [timestamp#213, value#214L]
623
StreamingDeduplicateExec
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
outputMode(OutputMode.Update).
start
-------------------------------------------
Batch: 0
-------------------------------------------
+---------+-----+
|timestamp|value|
+---------+-----+
+---------+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-07-25 22:12:03.018|0 |
|2017-07-25 22:12:08.018|5 |
|2017-07-25 22:12:04.018|1 |
|2017-07-25 22:12:06.018|3 |
|2017-07-25 22:12:05.018|2 |
|2017-07-25 22:12:07.018|4 |
+-----------------------+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----------------------+-----+
|timestamp |value|
+-----------------------+-----+
|2017-07-25 22:12:10.018|7 |
|2017-07-25 22:12:09.018|6 |
|2017-07-25 22:12:12.018|9 |
|2017-07-25 22:12:13.018|10 |
|2017-07-25 22:12:15.018|12 |
|2017-07-25 22:12:11.018|8 |
|2017-07-25 22:12:14.018|11 |
|2017-07-25 22:12:16.018|13 |
|2017-07-25 22:12:17.018|14 |
|2017-07-25 22:12:18.018|15 |
+-----------------------+-----+
// Eventually...
sq.stop
624
StreamingDeduplicateExec
/**
// Start spark-shell with debugging and Kafka support
625
StreamingDeduplicateExec
SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=500
5" \
./bin/spark-shell \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPSHOT
*/
// Reading
val topic1 = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
option("startingOffsets", "earliest").
load
scala> records.explain
== Physical Plan ==
*Project [cast(key#0 as string) AS key#249, cast(value#1 as string) AS value#250]
+- StreamingDeduplicate [value#1], StatefulOperatorStateInfo(<unknown>,68198b93-6184-49
ae-8098-006c32cc6192,0,0), 0
+- Exchange hashpartitioning(value#1, 200)
+- *Project [key#0, value#1]
+- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, offset#4L,
timestamp#5, timestampType#6]
// Writing
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
val sq = records.
writeStream.
format("console").
option("truncate", false).
trigger(Trigger.ProcessingTime(10.seconds)).
626
StreamingDeduplicateExec
queryName("from-kafka-topic1-to-console").
outputMode(OutputMode.Update).
start
// Eventually...
sq.stop
Refer to Logging.
doExecute(): RDD[InternalRow]
storeUpdateFunction that:
1. Generates an unsafe projection to access the key field (using keyExpressions and the
output schema of child).
Extracts the key from the row (using the unsafe projection above)
(when there was a state for the key in the row) Filters out (aka drops) the row
627
StreamingDeduplicateExec
(when there was no state for the key in the row) Stores a new (and empty) state for
the key and increments numUpdatedStateRows and numOutputRows metrics.
Updates allRemovalsTimeMs metric with the time taken to remove keys older than
the watermark from the StateStore
Updates commitTimeMs metric with the time taken to commit the changes to the
StateStore
StatefulOperatorStateInfo
Event-time watermark
shouldRunAnotherBatch …FIXME
628
StreamingDeduplicateExec
629
StreamingGlobalLimitExec
execution planning strategy is requested to plan a Limit logical operator (in the logical plan
of a streaming query) for execution.
The optional properties, i.e. the StatefulOperatorStateInfo and the output mode, are initially
undefined when StreamingGlobalLimitExec is created. StreamingGlobalLimitExec is updated
to hold execution-specific configuration when IncrementalExecution is requested to prepare
the logical plan (of a streaming query) for execution (when the state preparation rule is
executed).
Streaming Limit
630
StreamingGlobalLimitExec
StreamingGlobalLimitExec as StateStoreWriter
StreamingGlobalLimitExec is a stateful physical operator that can write to a state store.
Performance Metrics
StreamingGlobalLimitExec uses the performance metrics of the parent StateStoreWriter.
doExecute(): RDD[InternalRow]
doExecute …FIXME
Internal Properties
Name Description
FIXME
keySchema
Used when…FIXME
FIXME
valueSchema
Used when…FIXME
631
StreamingRelationExec
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
Output attributes
632
StreamingSymmetricHashJoinExec
StreamingSymmetricHashJoinExec Binary
Physical Operator — Stream-Stream Joins
StreamingSymmetricHashJoinExec is a binary physical operator that represents a stream-
(with the left and the right keys using the exact same data types).
execution planning strategy is requested to plan a logical query plan with a Join logical
operator of two streaming queries with equality predicates ( EqualTo and EqualNullSafe ).
StreamingSymmetricHashJoinExec uses two OneSideHashJoiners (for the left and right sides
of the join) to manage join state when processing partitions of the left and right sides of a
stream-stream join.
Join type
StatefulOperatorStateInfo
633
StreamingSymmetricHashJoinExec
Event-Time Watermark
output: Seq[Attribute]
For Cross and Inner ( InnerLike ) joins, it is the output schema of the left and right
operators
For LeftOuter joins, it is the output schema of the left operator with the attributes of the
right operator with nullability flag enabled ( true )
For RightOuter joins, it is the output schema of the right operator with the attributes of
the left operator with nullability flag enabled ( true )
outputPartitioning: Partitioning
634
StreamingSymmetricHashJoinExec
eventTimeWatermark: Option[Long]
was requested to apply the state preparation rule to a physical query plan of a streaming
query (to optimize (prepare) the physical plan of the streaming query once for
ContinuousExecution and every trigger for MicroBatchExecution in their queryPlanning
phases).
stateWatermarkPredicates: JoinStateWatermarkPredicates
635
StreamingSymmetricHashJoinExec
requiredChildDistribution: Seq[Distribution]
636
StreamingSymmetricHashJoinExec
number of updated Number of updated state rows of the left and right
state rows OneSideHashJoiners
shouldRunAnotherBatch(
newMetadata: OffsetSeqMetadata): Boolean
637
StreamingSymmetricHashJoinExec
Either the left or right join state watermark predicates are defined (in the
JoinStateWatermarkPredicates)
doExecute(): RDD[InternalRow]
doExecute then uses SymmetricHashJoinStateManager utility to get the names of the state
stores for the left and right sides of the streaming join.
In the end, doExecute requests the left and right child physical operators to execute
(generate an RDD) and then stateStoreAwareZipPartitions with processPartitions (and with
the StateStoreCoordinatorRef and the state stores).
processPartitions(
leftInputIter: Iterator[InternalRow],
rightInputIter: Iterator[InternalRow]): Iterator[InternalRow]
638
StreamingSymmetricHashJoinExec
processPartitions records the current time (as updateStartTimeNs for the total time to
processPartitions creates a OneSideHashJoiner for the LeftSide and all other properties
processPartitions creates a OneSideHashJoiner for the RightSide and all other properties
storeAndJoinWithOtherSide with the right-hand side one (that creates a leftOutputIter row
iterator) and the OneSideHashJoiner for the right-hand join side to do the same with the left-
hand side one (and creates a rightOutputIter row iterator).
processPartitions records the current time (as innerOutputCompletionTimeNs for the total
processPartitions creates a CompletionIterator with the left and right output iterators
(with the rows of the leftOutputIter first followed by rightOutputIter ). When no rows are
left to process, the CompletionIterator records the completion time.
For Inner joins, processPartitions simply uses the output iterator of the left and right
rows
processPartitions creates an UnsafeProjection for the output (and the output of the left
and right child operators) that counts all the rows of the join-specific output iterator (as the
numOutputRows metric) and generate an output projection.
In the end, processPartitions returns a CompletionIterator with with the output iterator
with the rows counted (as numOutputRows metric) and onOutputCompletion completion
function.
639
StreamingSymmetricHashJoinExec
onOutputCompletion: Unit
onOutputCompletion calculates the total time to update rows performance metric (that is the
onOutputCompletion adds the time for the inner join to complete (since
onOutputCompletion records the time to remove old state (per the join state watermark
predicate for the left and the right streaming queries) and adds it to the total time to remove
rows performance metric.
onOutputCompletion triggers the old state removal eagerly by iterating over the
Note
state rows to be deleted.
onOutputCompletion records the time for the left and right OneSideHashJoiners to commit
any state changes that becomes the time to commit changes performance metric.
onOutputCompletion calculates the number of updated state rows performance metric (as
the number of updated state rows of the left and right streaming queries).
onOutputCompletion calculates the number of total state rows performance metric (as the
sum of the number of keys in the KeyWithIndexToValueStore of the left and right streaming
queries).
onOutputCompletion calculates the memory used by state performance metric (as the sum
Internal Properties
640
StreamingSymmetricHashJoinExec
Name Description
SymmetricHashJoinStateManager
Used when OneSideHashJoiner is requested to
joinStateManager
storeAndJoinWithOtherSide, removeOldState,
commitStateAndGetMetrics, and for the values for a given
key
StateStoreConf
storeConf
Used exclusively to create a
SymmetricHashJoinStateManager
641
FlatMapGroupsWithStateStrategy
FlatMapGroupsWithStateStrategy Execution
Planning Strategy for FlatMapGroupsWithState
Logical Operator
FlatMapGroupsWithStateStrategy is an execution planning strategy that can plan streaming
Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.
642
FlatMapGroupsWithStateStrategy
import org.apache.spark.sql.streaming.GroupState
val stateFunc = (key: Long, values: Iterator[(Timestamp, Long)], state: GroupState[Long
]) => {
Iterator((key, values.size))
}
import java.sql.Timestamp
import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode}
val numGroups = spark.
readStream.
format("rate").
load.
as[(Timestamp, Long)].
groupByKey { case (time, value) => value % 2 }.
flatMapGroupsWithState(OutputMode.Update, GroupStateTimeout.NoTimeout)(stateFunc)
scala> numGroups.explain(true)
== Parsed Logical Plan ==
'SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS
_1#267L, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#268]
+- 'FlatMapGroupsWithState <function3>, unresolveddeserializer(upcast(getcolumnbyordin
al(0, LongType), LongType, - root class: "scala.Long"), value#262L), unresolveddeseria
lizer(newInstance(class scala.Tuple2), timestamp#253, value#254L), [value#262L], [time
stamp#253, value#254L], obj#266: scala.Tuple2, class[value[0]: bigint], Update, false,
NoTimeout
+- AppendColumns <function1>, class scala.Tuple2, [StructField(_1,TimestampType,tru
e), StructField(_2,LongType,false)], newInstance(class scala.Tuple2), [input[0, bigint
, false] AS value#262L]
+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@38bcac50,rate,
List(),None,List(),None,Map(),None), rate, [timestamp#253, value#254L]
...
== Physical Plan ==
*SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#267L, asser
tnotnull(input[0, scala.Tuple2, true])._2 AS _2#268]
+- FlatMapGroupsWithState <function3>, value#262: bigint, newInstance(class scala.Tupl
e2), [value#262L], [timestamp#253, value#254L], obj#266: scala.Tuple2, StatefulOperato
rStateInfo(<unknown>,84b5dccb-3fa6-4343-a99c-6fa5490c9b33,0,0), class[value[0]: bigint
], Update, NoTimeout, 0, 0
+- *Sort [value#262L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value#262L, 200)
+- AppendColumns <function1>, newInstance(class scala.Tuple2), [input[0, bigi
nt, false] AS value#262L]
+- StreamingRelation rate, [timestamp#253, value#254L]
643
StatefulAggregationStrategy
StatefulAggregationStrategy Execution
Planning Strategy — EventTimeWatermark and
Aggregate Logical Operators
StatefulAggregationStrategy is an execution planning strategy that is used to plan
Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.
spark.sessionState.planner.StatefulAggregationStrategy
1. HashAggregateExec
2. ObjectHashAggregateExec
Aggregate
3. SortAggregateExec
644
StatefulAggregationStrategy
// Eventually...
consoleOutput.stop
645
StatefulAggregationStrategy
planStreamingAggregation(
groupingExpressions: Seq[NamedExpression],
functionsWithoutDistinct: Seq[AggregateExpression],
resultExpressions: Seq[NamedExpression],
child: SparkPlan): Seq[SparkPlan]
partialAggregate ) with:
initialInputBufferOffset as 0
2. ObjectHashAggregateExec
Note 3. SortAggregateExec
with:
646
StatefulAggregationStrategy
with:
In the end, planStreamingAggregation creates the final aggregate physical operator (called
finalAndCompleteAggregate ) with:
647
StreamingDeduplicationStrategy
StreamingDeduplicationStrategy Execution
Planning Strategy for Deduplicate Logical
Operator
StreamingDeduplicationStrategy is an execution planning strategy that can plan streaming
Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.
spark.sessionState.planner.StreamingDeduplicationStrategy
FIXME
648
StreamingGlobalLimitStrategy
StreamingGlobalLimitStrategy Execution
Planning Strategy
StreamingGlobalLimitStrategy is an execution planning strategy that can plan streaming
queries with ReturnAnswer and Limit logical operators (over streaming queries) with the
Append output mode to StreamingGlobalLimitExec physical operator.
Tip Read up on Execution Planning Strategies in The Internals of Spark SQL book.
FIXME
649
StreamingJoinStrategy
a streaming query.
log4j.logger.org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys=ALL
Refer to Logging.
650
StreamingRelationStrategy
spark.sessionState.planner.StreamingRelationStrategy
651
StreamingRelationStrategy
652
UnsupportedOperationChecker
UnsupportedOperationChecker
UnsupportedOperationChecker checks whether the logical plan of a streaming query uses
checkForStreaming Method
checkForStreaming(
plan: LogicalPlan,
outputMode: OutputMode): Unit
2. Streaming aggregation with Append output mode requires watermark (on the grouping
expressions)
3. Multiple flatMapGroupsWithState operators are only allowed with Append output mode
checkForStreaming …FIXME
checkForStreaming finds all streaming aggregates (i.e. Aggregate logical operators with
streaming sources).
query.
653
UnsupportedOperationChecker
checkForStreaming asserts that watermark was defined for a streaming aggregation with
Caution FIXME
used when:
Caution FIXME
654
UnsupportedOperationChecker
655